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Abstract —The Message Passing Interface (MPI) is the prevalent programming model used on today’s supercomputers. Therefore, MPI 
library developers are looking for the best possible performance (shortest run-time) of individual MPI functions across many different 
supercomputer architectures. Several MPI benchmark suites have been developed to assess the performance of MPI implementations. 
Unfortunately, the outcome of these benchmarks is often neither reproducible nor statistically sound. To overcome these issues, we show 
which experimental factors have an impact on the run-time of blocking collective MPI operations and how to control them. We address the 
problem of process and clock synchronization in MPI benchmarks. Finally, we present a new experimental method that allows us to obtain 
reproducible and statistically sound MPI measurements. 
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^ Introduction 

r'Snice the Message Passing Interface (MPI) was standardized 
the 1990s, it has been the prevalent programming model on 
majority of supercomputers. As MPI is an essential build- 
jj^g block of high-performance applications, performance 
^oblems in the MPI library have direct consequences on the 
overall run-time of applications, 
ly-'j Library developers and algorithm designers have one 
J^estion in common: which algorithm works better (is faster) 
'^r a given communication problem? For example, which 
Complementation of broadcast is faster on p = 128 processors 
t'^ing a payload of to = 64 Bytes? As today's parallel systems 
C^n hardly be modeled analytically, empirical evaluations 
nising run-time tests of MPI functions are required to compare 
^■C^ifferent MPI implementations. It is therefore important to 
jTPeasure the run-time of MPI functions correctly. 

I MPI library developers rely on benchmark suites to test 
•their implementations. The problem is that the results of 
_ ^ese benchmarks may vary significantly, where Table 1 
k^ows one example. The table compares the minimum and 
5 ^aximum run-time of an MPl_Bcast on 16 nodes that were 
'^ported by 30 different calls (mpiruns) to the Intel MPI 
Benchmarks. The third column lists the difference between 
the minimum and maximum run-time in percent. We can 
see that for payloads of up to 512 Bytes, the run-times have 
an error of roughly 10 %. One solution might be to change 
the default parameters of the Intel MPI Benchmarks. For 
example, one could force the benchmark to perform more 
measurements. But the question then becomes: how many 
runs are sufficient to obtain reproducible results? 

It is a common practice—when comparing MPI imple¬ 
mentations as part of a scientific publication—to choose one 
of the available benchmarks and compare the results. Many 
MPI benchmarks report either the mean, the median, or the 
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Table 1 

Minimum and maximum run-time of MPi_Bcast obtained from 30 
different runs of the Intel MPI Benchmarks on 16 x 1 processes with 
NEC MPI f .2.8 on TUWien. 


Msg. size 
[Bytes] 

min(avg) 

[/isj 

max(avg) 

[psj 

diff 

[%] 

0 

0.04 

0.04 

0.00 

1 

2.93 

3.12 

6.09 

2 

2.83 

3.20 

11.56 

4 

2.82 

3.06 

7.84 

8 

2.86 

3.13 

8.63 

16 

2.84 

3.14 

9.55 

32 

3.13 

3.44 

9.01 

64 

3.15 

3.51 

10.26 

128 

3.62 

4.03 

10.17 

256 

4.34 

4.90 

11.43 

512 

5.41 

5.91 

8.46 

1024 

6.88 

7.05 

2.41 

2048 

9.52 

9.71 

1.96 

4096 

13.52 

13.91 

2.80 

8192 

19.30 

19.66 

1.83 

16384 

32.21 

33.13 

2.78 

32768 

53.57 

54.47 

1.65 


minimum run-time. The problem is that without using a 
dispersion metric and a rigorous statistical analysis, we can 
hardly determine whether an observation is repeatable or 
the result of chance. 

In this article, we make the following contributions to 
the problem of accurately benchmarking blocking collective MPI 
operations: 

1) We show that a precise synchronization of clocks is 
key to measure MPI functions accurately. 

2) We present a novel clock synchronization algorithm 
that has two advantages: (1) it is accurate as it 
accounts for the clock drift between processes, and 
(2) provides a good trade-off between time and 
accuracy. 

3) We establish a list of (experimental) factors that, we 
show, do significantly influence the outcome of MPI 
performance measurements. 

4) We propose a novel benchmarking method, includ- 
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Table 2 

Comparison of measures for several MPI benchmark suites. 


benchmark name 

ref. 

version 

mean 

min 

max 

measure of dispersion 

Intel MPI Benchmarks 

[1] 

4.0.0 

/ 

/ 

/ 

X 

MPIBench 

[2] 

l.Obeta 

/ 

/ 


sub-sampled data 

MPIBlib 

[3] 

1.2.0 

/ 



conf. interv. of the mean (default 95%) 

mpicroscope 

[4] 

1.0 

/ 

/ 

/ 

X 

mpptest 

[5] 

1.5 

1 min of means 


X 

NBCBench 

[6] 

1.1 

/ (+median) 


/ 

X 

OSU Micro-Benchmarks 

[7] 

4.4.1 

/ 

/ 

/ 

X 

Phloem MPI Benchmarks 

[8] 

1.0.0 

/ 

/ 

/ 

X 

SKaMPl 

[9] 

5.0.4 

/ 



std. error 


ing an experimental design and an appropriate data 
analysis strategy, that allows for a fair comparison 
of MPI libraries, but which is most importantly 
(1) statistically sound and (2) reproducible. 

We start by summarizing other MPI benchmarking 
approaches in Section 2 and discuss their strengths and 
shortcomings. Section 3 introduces our general experimental 
framework that we use for all experiments conducted as 
part of this article. In Section 4, we show the practical 
implications of applying different process synchronization 
methods and also present our novel clock synchronization 
algorithm. Section 5 takes a closer look at factors that may 
influence the experimental outcome (the run-time of an MPI 
function). Section 6 describes our method for comparing the 
performance of MPI libraries in a statistically sound manner. 
We summarize related work in Section 7 with an emphasis 
on statistically sound experiments, before we conclude in 
Section 8. 

2 A Brief History OF MPI Benchmarking 

We now give a history of MPI benchmarking. Ever since 
the first MPI standard was announced in 1995, several MPI 
benchmark suites have been proposed. Some of the best- 
known MPI benchmark suites are summarized in Table 2. 
The table includes information about the measures (e.g., min, 
max) that each benchmark uses to present run-times and 
which measure of dispersion is provided. It is complemented 
with the run-time measurement approaches implemented by 
each benchmark, which are separately summarized in the 
four pseudocode listings in Table 3. Furthermore, Table 4 
details the methods selected by each of the investigated 
benchmarks for computing and presenting the measured 
run-times. The data in Table 2 was gathered to the best of 
our knowledge, since some benchmarks, like the Special 
Karlsruher MPl-Benchmark (SKaMPl), have been released in 
many incarnations and some other ones, like the MPIBench, 
are currently not available for download^. We therefore also 
rely on the respective articles describing the benchmarks. 

mpptest was one of the first MPI benchmarks [5] 
and was a part of the MPICH distribution. Gropp and 
Lusk carefully designed mpptest to allow reproducible 
measurements for realistic usage scenarios. They pointed 
out common pitfalls when conducting MPI performance 
measurements, such as ignoring cache effects. In particular, 

1. We obtained the source code of MPIBench l.Obeta through private 
communication. 


to ensure cold caches when sending a message, mpptest 
uses a send and a receive buffer which are twice as big as 
the cache level that should be "cold". Then, the starting 
address of a message to be sent is always advanced in this 
larger buffer, trying to ensure that the data accessed are not 
cached. If a starting address does not leave enough space 
for the message to be sent, it is reset to the beginning of 
the buffer. At the time of designing mpptest, most of the 
hardware clocks were coarse-grained and therefore did not 
allow a precise measurement of one call to a specific MPI 
function (as this would have often resulted in obtaining a 0). 
To overcome this problem and to improve the reproducibility 
of results, mpptest measures the time f of nrep consecutive 
calls to an MPI function and computes the mean = f /nrep 
of these nrep observations. This measurement is repeated k 
times and the minimum over these k samples is reported, 
i.e., mini<j<fc ti. 

The SKaMPl benchmark is a highly configurable MPI 
benchmark suite [9] and features a domain-specific language 
for describing individual MPI benchmark tests. SKaMPl also 
allows to record MPI timings by using a window-based pro¬ 
cess synchronization approach, in addition to the commonly 
used MPl_Barrier (cf. measurement schemes (MSI) and 
(MS4) in Table 3). SKaMPl reports the arithmetic mean and 
the standard error of the run-times of MPI functions. It uses 
an iterative measuring process for each test case, where a test 
is repeated until the current relative standard error is smaller 
than a predefined maximum. 

MPIBlib by Lastovetsky et al. [3] works similarly to 
SKaMPl, as it computes a confidence interval of the mean 
based on the current sample. It stops the measurements 
when the sample mean is within a predefined range (e.g., a 
5 % difference) from the end of a 95% confidence interval. 
MPIBlib implements multiple methods for computing the 
sample mean, as shown in schemes (PS2) and (PS4) in Table 4. 
It also provides an additional scheme that measures the run¬ 
time on the root process only, but which we omitted for 
reasons of clarity. 

mpicroscope [4] and OSU Micro-Benchmarks [7] per¬ 
form repeated measurements of one specific MPI func¬ 
tion for a predefined number of times. They report the 
minimum, the maximum, and the mean run-time of a 
sample, mpicroscope attempts to reduce the number of 
measurements using a linear (or optionally exponential) 
decay of repetitions, i.e., if no new minimum execution time 
in a sample of nrep consecutive MPI function calls was found, 
the remaining number of repetitions is decreased. 

The Intel MPI Benchmarks [1] use a measurement method 
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Table 3 

Measurement schemes (MS) for blocking MPI collectives found in different MPI benchmarks. In scheme (MS4), depending on the implementation, 
Get_Time returns the local time (measured with MPi_wtime or rdtsc) or a logical global time. 


(MSI) SKaMPl, NBCBench, 
MPIBlib, MPlBench, 
mpicroscope, 

OSU Micro-Benchmarks 

(MS2) Intel MPI Benchmarks, 
mpptest 

(MS3) Phloem MPI Benchmarks 

(MS4) SKaMPl, NBCBench 

1: for obs in 1 to nrep do 

2: MPI_Barrier 

3: T-""Tobs] = MPI_Wtlme 

4: execute MPI function 

5: r-"”Tots] = MPI Wtlme 

6: t += l‘-“’"lobs] - 

H OSti Micro-Benchmarks 

1: MPl_Barrier //or omitted 

2: s_hmc = MPI_Wtime 

3: for obs in 1 to nrep do 

4: execute MPI function 

5: e_time = MPl_V}time 

6: t = {e_time — s_time) jnrep 

1: MPI_Barrier 

2: s_hmc = MPI_Wtime 

3: for obs in 1 to nrep do 

4: execute MPI function 

5: MPI_Barrier // or omitted 

6: = MPI_Wtime 

7: t = {e_time — s_time)/nrep 

1; Sync ClocksO 

2; Decide on start_time and win_size 

3; for obs in 1 to nrep do 

4; Wait UNTiL(sfflrf time + obs ■ win size) 

5: T-"”Tote] = Get_Time() 

6: execute MPI function 

7: r-‘'^‘^[obs] - Get_Time() 


Table 4 

Commonly used data processing schemes (PS) for benchmarking blocking collective MPI operations. 


(PSD SKaMPl 

(PS2) MPIBlib, mpicroscope 

(PS3) OSU 
Micro-Benchmarks 

(PS4) MPIBlib 

1: for obs in 1 to nrep do 

2: ('-“[ofcs] = r-'”'[ote] - (“-"“’[ote] 

3: Reduce nrep local run-times from 
each process on root: 

4: SORT(/"") 

5: l‘= i”“[ nrep/4 : (nrep - nrep/4)] 

6: print MEAN„„p/ 2 (i')r STDEV„„p/ 2 (i') 

1: for obs in 1 to nrep do 

2: = (■'-'“Tots] - i*-'“Tots] 

3: Reduce nrep local run-times from 
each process on root: 

(“Tots] = MAXp(il,-'°“'[ots]) 

4: print MEAN«rep(( ), CI„,ep{( ) 

//MPIBlib 

print MEAN„„p(i“*), MIN„„p((“*), 

MAX„,ep{(“'') //mpicroscope 

1: local_time = t/nrep 

2: Reduce local_time from 
each process to root: 
minjat = MlNp{local_time) 
maxjat = MAXp{local_fime) 
meanjat = S\J'Mp{local_time) 

/ nrep 

3: print meanjat, minjat, 
maxjat 

1: Normalize_Times((''-""”') 

2: for obs in 1 to nrep do 

3: Reduce (*'“'’‘''[ots] from each 

process to root: 

(““■^']ots] = MAXp((f'”'[ots]) 

4: ('[ots] = (“'‘■^']ots] - (“-"“Tots] 

5: print MEAN„rq,(('), CI„,q,((') 

(PS5) MPlBench 

(PS6) NBCBench 

(PS7) Intel MPI Benchmarks, 
Phloem MPI Benchmarks 

(PS8) mpptest 

1: NORMALIZE_TIMES(/“-'‘"'') 

2: NORMALIZE_TIMES(/'-'‘”') 

3: Gather from each pro¬ 

cess on mot into i-'JT 

4: for rank in 1 to p do 

5: for obs in 1 to nrep do 

6: i = {rank — 1) • nrep -j- obs 

8: l‘= REMOVE_OUTLIERSp.„,q,(t^„,) 

9: print MIN(i'), MAX(;'), MEAN(/') 

1: for obs in 1 to nrep do 

2: ]ots] = (■'-"“Tots] - (“-'“Tots] 

3: f = OP(('-'“°'') 

//OP G {min, max, mean, median} 

4: Gather t from each process on roof 
into 

5: MAXp(/0 

6: print 

1: Gather average times t 
on root process into 

2: print MlNp(Z^), MAXp(/^), 
MEANp(Z') 

1: Broadcast t from the root pro¬ 
cess 

2: collect MiN^epsCO over several 
repetitions of the measurement 
scheme 


similar to mpptest, i.e., the time is taken before and after 
executing nrep consecutive calls to an MPI function. Then, 
the benchmark computes the mean of the run-times over 
these nrep consecutive calls for each MPI rank. The final 
report includes the minimum, maximum, and average of 
these means across all ranks. 

The Phloem MPI Benchmarks [8] for MPI collectives 
measure the total time to execute nrep consecutive MPI 
function calls and compute the mean run-time for each rank. 
In addition, the Phloem MPI Benchmarks can be configured 
to interleave the evaluated MPI function calls with calls 
to MPl_Barrier in each iteration. Minimum, maximum, 
and average run-times across MPI ranks are provided upon 
benchmark completion. 

Grove and Coddington developed MPlBench [2], which, 
in addition to mean and minimum run-times, also plots 
a sub-sample of the raw data to show the dispersion of 
measurements. They discuss the problem of outlier detection 
and removal. In their work, the run-times that are bigger 


than some threshold time Uhresh are treated as outliers. To 
compute tthresh, they determine the 99th percentile of the 
sample, denoted as fgg, and then define tthresh = tgg ■ a for 
some constant a > 1 (default a = 2). Grove also shows 
the distribution of run-times obtained when measuring 
MPl_lsend with different message sizes [10, p. 127]. He 
highlights the fact that the execution time of MPI functions 
is not normally distributed. 

NBCBench was initially introduced to assess the run-time 
of non-blocking collective implementations in comparison to 
their blocking alternatives [6]. Later, Hoefler et al. explained 
how blocking and non-blocking collective MPI operations 
could be measured scalably and accurately [11]. The authors 
show that calling MPI functions consecutively can lead to 
pipelining effects, which could distort the results. To address 
these problems, they implement a window-based synchro¬ 
nization scheme, requiring 0{\ogp) rounds to complete, 
compared to the 0{p) rounds needed by SKaMPl, where 
p denotes the number of processes. 
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Algorithm 1 MPI timing procedure. 


1: procedure Time_MPI_function(/wmc, msize, nrep) 
fl fimc - MPI function 
I I msize - message size 
I I nrep - nb. of observations 
2: initialize time array f with nrep elements 

3: for obs in 1 to nrep do 

4: Sync_PrOCESSES() // either MPI_Barrier or window-based sync. 

5: = Get_Time() 

6: execute/wnc (msize) 

7: [obs] = Get_Time() 


8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 


if sync method == MPI_Barrier then 
for obs in 1 to nrep do 

/'-'““'[obs] = [obs] - /“-"'"“-'““'[obs] 

MP1_ReduCE(/'-'““', /', nrep, MPI_MAX, root) 

else 

normalize gig^al reference clock 

MPLReduCE]/®-'""'-'"'"', nrep, MPI_DOUBLE, MPI_MIN, root) 

MP1_REDUCE(/“-"'"'-'“'“', MPI_DOUBLE, MPI_MAX, root) 

for obs in 1 to nrep do 
Ifobs] - 

if my_rank == root then 
for obs in 1 to nrep do 
print l^[obs] 


3 Experimental Setup 

As the results of this paper heavily rely on the empirical 
analysis of hypotheses, we first introduce our measurement 
scheme for blocking, collective MPI functions, the data 
processing methods applied, and the parallel machines used 
for conducting our experiments. 


3.2.1 Completion Time based on Local Times 

In this case, each MPI process holds an array containing 
nrep local time measurements (run-times to complete a given 
MPI function). We apply a reduction operation (max) on 
that array and collect the results on the root process. Thus, 
the run-time of an MPI function func using p processes in 
iteration i, 0 < i < nrep, is given as T[i] = maxo<r<p{^*[i]}. 
In other words, the run-time of an MPI function is defined 
as the maximum local run-time over all processes. This 
run-time computation procedure is typically applied for 
measurements where processes are synchronized using 
MPI_Barrier. 

3.2.2 Completion Time based on Global Times 

When globally-synchronized clocks are available, we define 
the time to complete an MPI operation as the difference 
between the maximum finishing time and the minimum 
starting time among all processes. All nrep starting and 
finishing timestamps from all processes are gathered as 
vectors on the root node. Then, the root node computes the 
time of an MPI function /mmc using p processes in iteration i 
like this T[{\ = maxo<,<p{;''-‘'""'[i]} - mino<,<p{("-''’"‘'[z]}. We 
use this method to compute the completion time in all our 
experiments in which we employ a clock synchronization 
method. 


3.1 Notation 

The benchmarks NBCBench and Netgauge are related. For 
example, Hoefler et al. state the following: "We used our 
new findings to implement a new benchmark scheme in the 
benchmark suite Netgauge. The implementation bases on 
NBCBench [..]" [12]. For that reason, we use NBCBench to 
refer to the MPI benchmark and Netgauge to the algorithm 
that synchronizes clocks hierarchically. 

We use the following notation in the remainder of 
the article, which we borrowed from Kshemkalyani and 
Singhal [13]. The clock offset is the difference between the 
time reported by two clocks. The skew of the clock is the 
difference in the frequencies of two clocks and the clock drift 
is the difference between two clocks over a period of time. 

3.2 Timing Procedure 

In the experiments presented in this article, we measure the 
time for completing a single MPI function using the method 
shown in Algorithm 1. Before the start of a benchmark run, 
the experimenter chooses the number of observations nrep 
(sample size) to be recorded for an individual test, where 
a test consists of an MPI function, a message size, and a 
number of processes. Before starting to measure the run-time 
of an MPI function, all processes need to be synchronized. We 
examine two kinds of synchronization approaches: (1) the 
use of MPl_Barrier and (2) the window-based synchro¬ 
nization scheme. Advantages and disadvantages of each 
synchronization method will be discussed in more detail in 
Section 4. 

Depending on the type of synchronization, we use 
different ways to compute the time to complete a collective 
MPI operation, as detailed in Algorithm 1 (lines 8-17). 


3.3 Window-based Process Synchronization 

SKaMPI [14] was (to the best of our knowledge) the first MPI 
benchmark suite that used a window-based synchronization 
strategy to measure the run-time of MPI functions. Its 
window-based synchronization method works as follows: 
(1) The distributed clocks of all participating MPI processes 
are synchronized relative to a reference clock. To this end, 
each MPI process computes its clock offset relative to a master 
process (e.g., process 0) to be able to normalize its time to 
the master's reference clock. (2) The master process selects 
a start time, which is a point in time that lies in the future, 
and broadcasts this start time to all participating processes. 
(3) Since each process knows the time difference to the master 
process, all processes are now able to wait for this start time 
before executing the respective MPI function synchronously. 
When one MPI function call has been completed, all processes 
will wait for another future point in time before starting the 
next measurement. The time period between these distinct 
points is called a "window". 

This synchronization method shown in scheme (MS4) of 
Table 3 is used for all benchmarking experiments that rely 
on window-based process synchronization in this paper. 

3.4 High-Resolution Time Measurements 

Hoefler et al. discussed the problem that the resolution 
of MP l_wt ime is typically not high enough for measuring 
short time intervals [12]. They therefore use the CPU's clock 
register to count the number of processor cycles since reset. 
More specifically, Netgauge implements a time measurement 
mechanism based on the atomic RDTSC instruction, which 
provides access to the TSC register and which is supported by 
the x86 and x86-64 instruction set architectures (ISA). How¬ 
ever, several problems can arise when using this mechanism. 
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First, Hoefler et al. point out that dynamic frequency changes, 
which are automatically enabled in modern processors, 
can modify the CPU clock rate and thus compromise the 
time measurements. Second, in multi-processor systems, 
CPU clocks are not necessarily synchronized, requiring the 
processes to be pinned to cores to guarantee valid cycle 
counter values. 

This RDTSC mechanism is vulnerable to out-of-order 
instruction executions supported by most modern proces¬ 
sors [15], [16]. To overcome this issue, we performed our 
measurements using the equivalent RDTSCP call, which 
guarantees instruction serialization, that is, it makes sure 
that all instructions have been executed when the timestamp 
counter is read. Unless otherwise specified, we fixed the 
frequency to the highest available value and pinned each 
process to a specific core in all our experiments involving 
RDTSCP-based time measurements. 

As all our experimental platforms are Linux systems, we 
checked the TSC-related flags provided by / proc/cpuinf o. 
On all our systems, the flags constant_tsc and 
nonstop_tsc were set, indicating that the speed of up¬ 
dating the TSC register is independent of the current core 
frequency. Nevertheless, we need to make sure that processes 
are pinned to cores throughout the measurements to avoid 
accidentally reading the TSC register of another core. 

3.5 Data Processing 

Most of the benchmarks listed in Table 2 use some form 
of implicit outlier removal (e.g., taking the minimum time 
recorded). In addition, many benchmarks perform a number 
of warm-up rounds to fill caches or to set up communication 
links. After the initial warm-up phase has completed, the 
measurements taken are used to compute the final statistics. 
One problem is that the operating system noise can lead to 
relatively large variations of the measured run-time at any 
moment within the benchmark execution. A second problem 
is that it is hard to estimate how many warm-up rounds are 
sufficient to reach a “steady state". To make our benchmark 
method robust against these two problems, we use Tukey's 
outlier filter to remove outliers after all measurements have 
been recorded [17, p. 126]. When applying this filter, we 
remove all measurements from the sample that are either 
smaller than Qi — 1.5 • IQR or larger than + 1.5 • IQR. 
IQR denotes the interquartile range between quartiles Qi 
and Qs. 

3.6 Parallel Machines 

The parallel machines used for conducting our experiments 
are summarized in Table 5. On the TUWien system, we have 
dedicated access to the entire cluster. The Edel (G5k) system 
belongs to Grid'5000^, which features the OAR job scheduler 
that allows us to gain exclusive access to a set of nodes 
connected to the same InfiniBand (IB) switch. On VSC-1, 
VSC-3, and Cartesius (SURFsara), we also made sure that our 
allocations include dedicated nodes only. However, we have 
no dedicated access to the switches as in the case of the other 
two machines. 

2. http://www.grid5000.fr 


4 MPI Process Synchronization Revisited 

Now, we turn our attention to the problem of synchronizing 
MPI processes, and its implication on performance results. 
A commonly employed synchronization method for MPI 
processes is the use of the MPl_Barrier call. The problem 
is that the completion of MPl_Barrier only ensures that 
all processes have called this function, but processes can 
still leave the barrier skewed in time. Depending on the 
amount of skew and the actual test case, the run-time of 
subsequent MPI calls can fluctuate considerably. For instance, 
a process skew of 5 ps introduced by calling MPl_Barrier 
has more effect on an MPI_Bcast with a message size 
of 8 Bytes than it has on an MPl_Bcast with SKiBytes. 
Hoefler et al. also point out that a call to MPl_Barrier 
can influence the collective operation being benchmarked 
when both operations use the same network [11]. 

In order to prevent such problems with MPl_Barrier, 
several MPI benchmarks use a window-based process syn¬ 
chronization scheme to ensure that all processes start calling 
a given MPI function at the same time. 

We will take a closer look at both synchronization 
methods and discuss their advantages and disadvantages. 
Then, we will investigate the synchronization quality of 
MPl_Barrier. Afterwards, we propose a novel clock 
synchronization method, which combines features of sev¬ 
eral competitors, namely (1) learning and applying a 
model of the clock drift and (2) optionally, synchronization 
in 0(logp) steps. Finally, we will examine whether the 
window-based synchronization methods are competitive 
with MPI_Barrier. 

4.1 MPI Process Synchronization in the Wild 

The clock synchronization algorithm used in SKaMPI is 
similar to Cristian's algorithm [18] and works as follows: one 
of the p MPI processes is selected as the master process with 
which all other processes will synchronize. Then, a number 
of ping-pong messages is exchanged between process pairs, 
and the processes' local time is piggy-backed on a ping- 
pong message. Using this approach, the master process 
can determine the clock differences between itself and each 
of the other processes. The clock offset computed at the 
master process is later broadcast to the others, which allows 
all processes to compute a logical global time. The logical 
global clock is used to synchronize processes, as shown in 
Algorithms 7 and 8. 

A major drawback of SKaMPTs approach is that it re¬ 
quires linear time to synchronize distributed clocks. To speed 
up the clock synchronization, Hoefler et al. implemented 
a more scalable method, which only requires a logarithmic 
number of steps to complete (depending on the number of 
processes). It is used in both NBCBench and Netgauge [12]. 
Their method outlined in Algorithms 11, 12, and 13 works 
as follows: the set of processes is divided into two groups: 
one group (GROUP 1) contains all processes up to the largest 
rank that is a power of two, and a second group (GROUP 2) 
contains the remaining processes. The clock synchronization 
is done in two phases. In the first one, all processes in 
GROUP 1 synchronize their clocks in a tree-like fashion in 
logp rounds. In the second phase, each remaining process 



Table 5 

Overview of parallel machines used in the experiments. 
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name 

nodes 

interconnect 

MPI libraries 

compilers 

TUWien 

36 Dual Opteron 6134 @2.3 GHz 

IB QDR MT26428 

NEC MPI/LX 1.2.11 
NEC MPI 1.2.8 
MVAPICH 1.9 
MVAPICH 2.1a 

gcc 4.4.7 

VSC-1 

436 Dual Xeon 5550 @2.66 GHz 

IB QLogic 12200 

Intel MPI 4.1 

gcc 4.4.7 

VSC-3 

2000 Dual Xeon E5-2650V2 @2.6 GHz 

IB QDR-80 

Intel MPI 5 
MVAPICH 2.0a-qlc 

gcc 4.4.7 

Edel (G5k) 

72 Dual Xeon E5520 @ 2.27 GHz 

IB QDR MT26428 

MVAPICH 1.9 

gcc 4.7.2 

Cartesius (SURFsara) 

64 Dual Xeon E5-2450V2 @2.5 GHz 

IB Mellanox ConnectX-3 FDR 

Intel MPI 4.1 

gcc 4.4.7 


of GROUP 2 synchronizes its clock with one distinct partner 
process from GROUP 1 in one additional round. 

Algorithm 7 shows the pseudocode of SKaMPl's method 
to determine the clock offsets between two processes. 
SKaMPl synchronizes the clock of each process with the 
clock of the root node in 0{p) steps (cf. procedure COM- 
pute_And_Set_Clock_Offsets in Algorithm 8). Hoe- 
fler et al. showed how the time to compute these clock 
offsets can be reduced by using a tree-like synchronization 
process [12] (see Algorithms 11 and 12). 

The benchmarking methods proposed in SKaMPl and 
NBCBench rely on a periodic re-adjustment of the window 
size to cope with run-times that are too large to fit into 
the synchronization window. Further, they implement a 
minimal re-synchronization mechanism that broadcasts a 
new point in time used for synchronizing processes for each 
new experiment (e.g., for a different message size to be 
benchmarked). However, neither SKaMPl nor NBCBench 
implement a re-synchronization of the distributed clocks to 
counterbalance the clock drift, cf. Algorithms 10 and 14. 

Jones and Koenig proposed a clock synchronization 
method that takes the clock drift between distributed pro¬ 
cesses into account [19]. Their method is based on the 
assumption that the clock drift is Linear in time. Each process 
learns a linear model of the clock drift by exchanging 
ping-pong messages with a single reference process. After 
computing the linear model of the clock drift (using a linear 
regression), each process can determine a logical global 
time by adjusting its local time relative to the time of the 
reference process. In the ping-pong phase, local times are 
exchanged between each process and the reference process. 
When processes receive a local timestamp from the reference 
process, some time has already passed, which is the time 
for transferring the timestamp message. To account for this 
delay, the received timestamps are corrected by half of the 
mean round-trip time (RTT). Jones and Koenig do not further 
detail how the RTT is obtained, even though the estimation 
of the RTT could be a source of error. 

4.2 Sources of Error for Clock Synchronization 

In Section 3.3, we have introduced the window-based 
synchronization scheme to measure the run-time of one 
MPI function call. As this scheme requires synchronized, dis¬ 
tributed clocks, we now show some pitfalls when applying 
this scheme. 

In the context of MPI benchmarking, Hoefler et al. have 
shown that two processor clocks are linearly drifting over 



-400--^-,-,-^- ,- r- 

0 10 20 30 40 50 

time at process 0 [s] 


Figure 1. Clock drift between a reference host and six other hosts on 
TUWien (Exp. details: Appendix C.1). 

time [12]. We re-conducted their experiment to examine 
the clock drift on our current machines, but using a finer 
resolution than what was done in [12] (we only measure 
in the range of seconds instead of hours). Figure 1 shows 
that the maximum clock drift between two hosts of our 
cluster is about 700ps (|—400ps|-F300ps) after 50s. Thus, 
not accounting for the clock drift will lead to highly in¬ 
accurate window-based measurements, with a drift error 
in the range of microseconds after only a few seconds of 
conducting measurements. Hence, a window-based scheme 
must precisely deal with both clock offset and clock drift. 

4.2.1 The Error of the Frequency Estimation 

As mentioned in Section 3.4, Netgauge supports reading the 
TSC registers to obtain fine-grained timers. In order to con¬ 
vert elapsed clock ticks into seconds, the update frequency of 
the TSC registers is required. Netgauge applies the following 
method to estimate this frequency: the estimation routine 
sleeps for a fixed amount of time, in case of Netgauge for 1 s, 
and measures the number of ticks elapsed in this period. The 
process is repeated until a minimum number of ticks (given 
a search time threshold) has been recorded for the fixed-size 
sleeping period. In this way, Netgauge is able to estimate the 
update frequency of the TSC register. 

We were interested in evaluating how accurate Net- 
gauge's method for estimating the CPU frequency is, and 
whether it would be beneficial to use it for our benchmarking 
purposes. We conducted an experiment (see Algorithm 19), 
in which we called Netgauge's frequency estimation macro 
(hrt_calibrate) 100 times on 16 different nodes (each 
clocked at 2.3 GHz), and recorded the estimated frequency 
on each core. Figure 2 shows the frequency variation obtained 
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Figure 2. Differences of clock speeds obtained on 16 nodes (processes) 
for 100 calls to hrt_calibrate of Netgauge (TUWien, Exp. details: 
Appendix C.2). 


o Calibrated frequency 
□ Fixed frequency 


1 2 3 4 5 6.7 8 

time since synchronization [s] 


Figure 3. Mean clock drift and 95 % confidence intervals for the Netgauge 
synchronization (16 x 1 processes, 10 calls to mplrun, MVAPICFI 2.1a, 
TUWien, Exp. details: Appendix C.3). 


on each of the 16 nodes. The results indicate that for most 
nodes the estimated frequency variation lies in the range of 
10 kHz to 20 kHz, which at first glance suggests that such an 
estimation method would be reproducible. However, if we 
analyze the error of this frequency estimation, we obtain a 
variation of roughly 10 kHz for all cores (which translates 
to 10“® GHz). That means, if we assume that a processor 
runs at a fixed clock frequency of 2.3 GHz, then the error is 
R:! 4.3 • 10“®. Consequently, applying this frequency 
estimation method in the context of MPI benchmarking 
results in an inherent timing error of 1 ps per second. 

We also examined whether this error could be reduced 
when using a fixed clock frequency to convert clock ticks 
obtained from RDTSC/RDTSCP into seconds. We therefore 
compared the clock drift of Netgauge using its original 
frequency estimation to the clock drift with a fixed frequency 
(here we fixed it to 2.3 GHz). Figure 3 compares the mean 
clock drifts over 10 mpirun calls, which were measured 
for both frequency estimation methods (cf. Algorithm 20). 
Surprisingly, the clock drifts are significantly different de¬ 
pending on whether we estimate the clock frequency or not. 
According to these results, the clock drift after 10 s using 
Netgauge's frequency estimation is almost 10 times bigger 
than when we use a fixed frequency. As a consequence, we 
have decided to use a fixed clock frequency when converting 
the clock ticks (obtained from RDTSC) into seconds for our 
window-based MPI benchmarking scheme. 



Figure 4. Drifting run-times of MPi_Bcast (mean + Cl) using Net- 
gauge and SKaMPI synchronization methods compared to the results 
obtained with MPi_Barrier (8192 Bytes, 32 X 16 processes, 4000 mea¬ 
surements, 1 call to mpirun, MVAPICFI 2.1a, TUWien, Exp. details: 
Appendix C.4). 


4.2.2 The Error of Accounting for the Clock Offset only 

Now, we examine how accurately the clock synchronization 
schemes of Netgauge and SKaMPI work in practice. We 
designed an experiment that measures the individual run¬ 
times of 4000 consecutive calls to MPl_Bcast using 512 pro¬ 
cesses (distributed over 32 compute nodes). The process 
synchronization between calls to MPl_Bcast is either done 
by using an MP l_Barrier call or by applying the window- 
based schemes implemented in Netgauge or SKaMPI with 
a fixed window size. Figure 4 shows the development of 
the mean run-time of MPl_Bcast over time. For presenta¬ 
tion purposes, we binned every group of 100 consecutive, 
individual measurements and only plotted the bin means 
and their confidence intervals. As expected, the run-time 
of MPl_Bcast stays relatively stable when synchronizing 
with MPl_Barrier (as this process synchronization method 
is independent of the clock). However, the binned run¬ 
times increase over time when a window-based scheme is 
applied. The underlying problem is that neither Netgauge 
nor SKaMPI consider the clock drift when synchronizing 
clocks. Instead, both benchmarks "only" determine the clock 
offset between processes. 

In conclusion, we contend that clock synchronization 
schemes need to consider the clock drift when computing 
the logical global clock. 

4.3 Accounting for the Clock Drift and Offset 

Jones and Koenig propose a method to synchronize dis¬ 
tributed clocks while considering the clock drift and the clock 
offset [19]. Their goal was to have accurately synchronized 
clocks for implementing co-scheduling algorithms of parallel 
applications and tracing functionality in the MPI domain. 

Algorithm 15 presents the clock synchronization method 
proposed by Jones and Koenig. The idea is to learn a linear 
model of the clock drift between a process and the reference 
process. The algorithm of Jones and Koenig relies on two 
parameters. N_FITPTS specifies how many points {afitpoint is 
a tuple containing a reference clock timestamp and a clock 
offset) will be recorded as input for the linear regression 
analysis. N_EXCHANGES denotes the number of ping-pong 
messages exchanged between a pair of processes to obtain 

















Algorithm 2 HCA clock synchronization. 

p - number of processes 
r - current process rank (0 to p — 1) 

Im - linear model of the current process (defined by a slope 
and an intercept) to adjust the local clock to the reference 
time of root 

jmodei _ of p linear models 
^ Qj H reference clock 

hierarchical_intercepts - if defined, compute intercepts 
hierarchically (instead of directly between each r and root) 
start_time - next window start time, updated after each sync 
initial_time - local timestamp used to adjust the local clock to 
the time 0 of the synchronization start 
maxpower — 

1: procedure Sync_Clocks(n_fitpts, n_exchanges) 

2: initial_time — Get_Time() 

// compute linear models of each clock's drift relative to root 
3: Sync_Clocks_Pow2(n_fitpts, n_exchanges) 

4: Sync_Clocks_Remaining(n_fitpts, n_exchanges) 

// send final linear models from root to each process 
5: MPI_SCATTER(Z'”®‘^^', 1, T_PAIR_DOUBLE, /m, 1, T_PAIR_DOUBLE, root) 

#Ifndef hierarchical_intercepts 
6: COMPUTE_AND_SET_ALL_lNTERCEPTS(/m) 

#EndIfndef 
7: MPI_Barrier() 

8: start _time — Get_Adjusted_Time() + win_size 

9: MFl_BCAST{start_time, 1, MPl_DOUBLE, root) 


a single fitpoint. The entire method works as follows: a 
process exchanges N_EXCHANGES ping-pong messages with 
the master process, where each message carries the local 
time of a process. The remote process computes the time 
difference between its local time and the local time of the 
reference process. Since these measured clock offsets have a 
relatively large variance on many networks, the process only 
keeps one of these (n_EXCHANGES many) time differences by 
selecting the median value. Then, this ping-pong exchange 
phase between a process and the master (reference) process 
is repeated N_EITPTS times. Finally, each remote process 
computes its model of the clock drift relative to the master 
process by fitting a linear regression model to the recorded 
clock offsets (time differences). 

In addition, since each of the measured clock offsets needs 
to be corrected by rff/2 to account for the network latency, 
we present our method for estimating the RTT between two 
processes in Algorithm 17. 

Once the clock drift model is established, each measured 
local time can be normalized to the reference time of the 
master process by applying a drift correction, as shown in 
Algorithm 16. 

4.4 HCA Algorithm for Clock Synchronization 


Algorithm 3 Hierarchical linear models of the clock drift. 

1: function Get_Adjusted_Time 
2: return Get_Time() - initial_time 


3: procedure Sync_Clocks_POw2(n_fitpts, n_exchanges) 

// compute linear models of the clock drifts for processes with 
H indices between 0 and (maxpower — l) 

4: round — 1 

5: if r > maxpower then return 

6: while < maxpower do 

7: if (r mod 2™“”'^) == 0 then Hprocess with reference clock 

8: client - r+ 

9: rtt — COMPUTE_RTT(r, client) 

10; Learn_Model_HCA(n_fitpts, n_exchanges, rtt, r, client) 

#Ifdef hierarchicaljntercepts 

11; Compute_And_Set_Intercept(null, client, r) 

#EndIfdef 


12 

13 

14 

15 

16 

17 

18 
19 


20 ; 


// receive linear models collected by the client 

MPI_RECv(/2,^f, 2''^'"'^'“^, t_pair_double, client) 

[client] - [0] // save client model 

for z in 1 to (2''*^“"'^“^ — 1) do //compute resulting models 

r°^^fclient + z] - [client], /SlfW) 

else if (r mod = 2''"“'"^-^ then //client 

p_ref^ r - 2™""“^-! 
rtt = COMPUTE_RTT(p_re/, r) 

l'^°^^^[r] = Learn_Model_HCA(n_fitpts, N_EXCHANGES, 
rtt, p_ref r) 

#Ifdef hierarchicaljntercepts 

COMPUTE_AND_SET_lNTERCEPT(Z'”°‘^^'[r], r, p_ref) 

#EndIfdef 


// send all new linear models to the reference process 
21; MPI_SEND(/'”‘’‘^^'[r], 2''^“”^'“^, T_PAIR_DOUBLE, p_ref) 

22: round — round + 1 


23; procedure Sync_Clocks_Remaining(n_fitpts, n_exchanges) 

// compute linear models of the clock drifts for processes with 
// indices between maxpower and (p — 1} 

24; if maxpower == p then return 

25; if r < p — maxpower then //process with reference clock 

26; client = r + maxpower 

27; rtt — COMPUTE_RTT(r, client) 

28; Learn_Model_HCA(n_fitpts, n_exchanges, rtt, r, client) 

#Ifdef hierarchicaljntercepts 

29; Compute_And_Set_Intercept(null, client, r) 

#EndIfdef 

30; else if r > maxpower then //client 

31; P-^^f — r — maxpower 

32; rtt — COMPUTE_RTT(p_rc/, r) 

33; Im — Learn_Model_HCA(n_fitpts, n_exchanges, 

rtt, p_ref, r) 

#Ifdef hierarchicaljntercepts 

34; COMPUTE_AND_SET_lNTERCEPT(/m, r, p_ref) 

#EndIfdef 

35; subjcomm — create communicator comprising process ranks 
(0, maxpower, maxpower + 1, . . . , p — 1) 

36; if r == root then 

37; MPI_GATHER(/m, 1, t_pair_double, 

38; tmpjm, 1, t_pair_double, root, sub_comm) 

39; for j in 0 to (p — maxpower — 1) do 

40; q — maxpower + j 

41: /”“*'[<;] = MERGE_LMS(r”*'[i], tmpjm[j + 1]) 

42; else if r > maxpower then 

43; MPI_GATHER(/m, 1, t_pair_double, 

44; tmpjm, 1, T_PAlR_DOUBLE, root, sub_comm) 


We want to combine the advantages of the synchronization 
algorithm developed by Jones and Koenig (JK) with the 
synchronization scheme applied by Netgauge. We propose a 
novel algorithm that synchronizes distributed clocks in a hier¬ 
archical way, but also takes the clock drift into account. Jones 
and Koenig already noted in their article that they had chosen 
a 0{p) scheme for better accuracy, "whereas a balanced 
0(\ogp) scheme may complete in milliseconds with higher 
variance (owing to the multiple reference stratums)" [19]. 
We still want to explore the possibility of applying such a 
0(\ogp) scheme to improve the scalability of the original 
algorithm of Jones and Koenig. 

Algorithm 3 shows the pseudocode of our novel algo¬ 
rithm called HCA (for Hunold and Carpen-Amarie). The 


45: procedure COMPUTE_AND_SET_ALL_lNTERCEPTS(/m) 
//compute intercepts for the model Im of the current process r 
46; if r root then 

47; COMPUTE_AND_SET_lNTERCEPT(/m, r, root) 

48; else 

49; for z in 0 to (p — 1) s.t. z root do 

50; COMPUTE_AND_SET_lNTERCEPT(/m, i, root) 


computational structure of HCA works similarly to the 
algorithm described by Hoefler et al. [12]. The difference, 
however, is that instead of determining only the clock offset 
of each process relative to the root, HCA computes a linear 
model of the clock drift of each process to obtain a global 
clock. 

The algorithm synchronizes clocks in two steps. In the 











9 


Algorithm 4 Clock drift model for a pair of processes. 

1: function Learn_Model_Hca(n_fitpts, n_exchanges, rtt, pi, p2) 

2: slope — 0, intercept — 0 

3: if my_ran/:——then //process with reference clock 

4: for idx in 0 to N_EITPTS — 1 do 

5: for i in 0 to N_EXCEIANGES — 1 do 

6: MPI_RECV(trfummi/, 1, MPI_DOUBLE, p2) 

7: tremote = Get_Adjusted_Time() 

8: MPI_SEND(fremote, 1, MPI_DOUBLE, p2) 

9: else if my_rank —— p2 then // client process 

10: for idx in 0 to N_FITPTS — 1 do 

11: for i in 0 to N_EXCHANGES — 1 do 

12: MPI_SEND(fd«mmy, 1, MPI_DOUBLE, pi) 

13: MPI_RECV(fremote, 1, MPI_DOUBLE, pi) 

14: localJimes[i] = Get_Adjusted_Time() 

15: — local _times[i] — tremote — rtt/2 

16: = SORT(Z‘'®') 

17: yfit[i<lx] = COMPUTE_MEDIAN(i*y) 

18: idxjmedian — i s.t. (0 < i < N_EXCHANGES &c 

== yfit[idx]) 

19: xfit[idx] — local_times[idx_median] 

20: slope, intercept — LiNEAR_¥n{xfit, yfit, n_fitpts) 

21: return NEW_LM(s/ope, intercept) 

22: procedure COMPUTE_AND_SET_lNTERCEPT(/m, client, p_ref) 

// compute the intercept using the SKaMPI method 
23: if r == client then 

24: dijf = SKAMPI_PlNGPONG(cfenf, 

25: diffjimestamp — Get_Adjusted_Time() 

26: Im.intercept — Im.slope ■ {—dijf_timestamp) + dijf 

27: else if r == p_ref then 

28: diff = SKAMPl_PlNQPONQ{cUent, p_ref) 

29: function MERGE_LMs(/mI, ImZ) 

30: nezvjm.intercept — Iml.intercept + ImZ.intercept — ImZ.intercept ■ Iml.slope 

31: nezvjm.slope — Iml.slope + ImZ.slope — Iml.slope ■ ImZ.slope 

32: return newjm 


first step, the clock drifts of processes with ranks smaller than 
the largest power of two ( 0 ,..., 2L*°S2 pJ — l) are estimated 
in function Sync_Clocks_POW 2 of Algorithm 3 . Then, in 
the second step, the remaining processes with larger ranks 
(> 2L*°S2 pJ) compute their linear models of the clock drift 
with respect to the already synchronized processes in one 
additional round (cf. Sync_Clocks_Remaining function). 

The major difference to the synchronization method 
found in Netgauge is the call to Learn_Model_HCA, 
which determines the model of the clock drift between 
two processes (Algorithm 4). The parameters N_EITPTS and 
N_EXCHANGES play the same role as in the algorithm of 
Jones and Koenig. Further, we use the same RTT estimation 
function as we did for the JK synchronization, which is 
presented Algorithm 17. 

In the case of Netgauge, intermediate clock offsets are 
summed up in a tree-like fashion to compute the offset of 
each process relative to the reference root node. To similarly 
build linear models of the clock drift, we needed to solve 
the problem of combining linear regression models. More 
formally, let us assume we have three processes located on 
different hosts called pi, p 2 , and pa, such that each process 
has its own clock. If the clocks of hosts pi and p 2 have an 
offset of diff^^ and the clocks of p 2 and pa have an offset of 
^^ffp 2 ps' clock offset between pi and pa can be computed 

=*^Pl.P2+*^P..P3■ 

Therefore, we apply a similar transitive computation to 
combine linear regression models to obtain the clock drift 
between different processes. If the clock drifts are computed 
in one round for process pairs (pi,P 2 ) and (p 2 ,Pa)r the 
question becomes: how should these two linear models be 


combined such that pa can obtain its clock drift with respect 
topi? 

Let us denote the model of the clock drift of p 2 relative 
to Pi as = ti — t 2 = Similarly, 

the clock drift of pa relative to p 2 is given as P^‘^{t 2 ) = 
G — G = s^^'^t 2 + The computation of the clock drift 
between pa and pi is shown in Equation 1 and implemented 
in Merge_LMS (cf. line 29 of Algorithm 4). 


/3—^1/"/ \ 3—^Ij. I 

t (ti) = S ti 


= ti — ta 

2-»l, I .2->l 

= S ti + I 

-E S t2 +t 

2-»l, , .2->l 

= S ti + I 


, 3^2,. 2->l, .2^1 n , .3 

-E S (tl-S tl-l ) -E Z 


= ti(s^ 


I O —A 

-\- s — s 


3^2 - 2^1 . - 3^2 

— 5 % 


2 \ , .2 
) + * 


( 1 ) 


To estimate the error of the computed linear model 
of the clock offset as a function of time, we conducted a 
statistical analysis for each pair of processes. We performed 
an experiment in which, for 15 pairs of processes running on 
different nodes, we measured 1000 fitpoints. We estimated 
the confidence intervals of both the slope and the intercept 
of each process pair. In the case of the slope, the length 
of the confidence interval is at most 2 x 10“®, whereas 
the intercept computation revealed much wider confidence 
intervals, in the order of 100 ms. The consequence of these 
larger intervals is that the intercept computed with a linear 
regression analysis will decrease the accuracy of the initial 
clock offset with a high probability. Thus, the global clock 
error will increase over time. 

To minimize the impact of the intercept error, we do not 
use the intercepts computed with a linear regression analysis. 
Instead, we have explored two approaches that appeared to 
be promising for computing the intercepts of the clock drift. 
Both approaches rely on SKaMPTs method for determining 
the clock offset between two processes at a given point in 
time (SKaMPI_PingPONG). The intercepts can be obtained 
by measuring the clock offset between two processes and 
then using the already computed slope to find the intercept of 
the linear clock model (Compute_And_Set_Intercept of 
Algorithm 4). The reason why we have selected the SKaMPI 
method for computing the offset is that it provided us with 
the lowest initial clock offset values, as it will be shown in 
the first experiment in Section 4.5. 

The Ji’rsf approach is to compute the intercepts in 0{p) 
rounds after completing the hierarchical computation of 
the clock models, which only requires O{logp) rounds. We 
employ SKaMPTs clock synchronization to measure the clock 
offset between the root and each of the other p — 1 processes as 
shown in function Compute_And_Set_All_Intercepts. 
The advantage is that the intercept is measured for each clock 
model separately. Thus, the intercept error only depends on 
the accuracy of a single SKaMPI synchronization and on the 
error of the slope, which was found to be very small (10“®). 

The second approach is to compute the intercepts during 
the hierarchical computation of the clock model in 0(\ogp) 
rounds. This algorithmic option is enabled by defining the 
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follows: 
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Figure 5. Adjusting the intercept of the linear model using SKaMPI clock 
synchronization method after learning the slope of the clock drift. 


global variable hierarchicaljntercepts. In this case, we measure 
the clock offset and compute the new intercept by adjusting 
the offset using the clock model. Then, the intercept obtained 
from the linear regression is replaced with this new intercept. 
Here, the SKaMPI method is used to measure the clock offset 
for a pair of processes in each round. In order to compute 
the clock model between each process and the root, the linear 
models are combined hierarchically using Equation (1). The 
advantage of this method compared to the first approach 
is its better scalability. The downside is that relying on a 
combined intercept for the linear model increases the error 
of the logical global clock. 

The intercept used in the linear models represents the 
clock offset at time zero. However, the clocks provided 
by the system (RDTSC, MPl_Wtiine) can start at arbitrary 
values. For this reason, we use a logical local clock that 
starts with value zero (by subtracting the initially found 
timestamp, cf. line 1 of Algorithm 3). Figure 5 depicts our 
method for obtaining accurate linear models of the drift. 
First, the fitpoints are measured using timestamps that are 
adjusted to the initial local time. Next, the slope and the 
(temporary) intercept are computed using these fitpoints. 
Then, we re-measure the clock offset at a given point in 
time (called synchronization point) by applying the SKaMPI 
approach. Finally, the adjusted (and therefore final) intercept 
is computed based on the slope and the measured clock 
offset. 

We would like to point out that HCA should be consid¬ 
ered as a general framework to synchronize clocks. In the 
present paper, we have used the method of Jones and Koenig 
to compute the clock drift model and SKaMPTs method 
to improve the accuracy of the model intercept. However, 
the concrete implementations of (1) how to obtain the linear 
model or (2) how to measure the clock offsets can be modified 
by substituting the functions Learn_MODEL_HCA and 
Compute_And_Set_Intercept, respectively. 

We now need to estimate the errors introduced by 
hierarchically combining linear models using HCA. Let 
be the confidence interval of the slope (of the 
clock drift) of process p 2 relative to pi, and the 

confidence interval of the corresponding intercept. The im¬ 
pact of merging two linear models (according to Equation 1) 
on the resulting slope and intercept can be computed as 


In the case of the first approach of our algorithm, we 
are only interested in the errors of merging slopes, as the 
computed (linear regression) intercepts are disregarded in 
favor of clock offsets measured with SKaMPI in a linear 
fashion. The confidence intervals (CIs) of the computed 
slopes in our experiments on machine TUWien were in 
the order of 10“®. The Cl of the resulting slope only 
depends on the initial slopes and increases linearly with 
the number of performed merge operations due to the 
negligible resulting slope products. Thus, the error of the 
slope grows logarithmically in the number of processes when 
using our hierarchical way of combining linear regression 
models. Consequently, the merging error will only reach the 
order of microseconds when the experiment is conducted 
on processes. The second approach of the HCA method 
relies on hierarchically combining both intercepts and slopes. 
In our case, the values of the computed slopes are several 
orders of magnitude smaller than those of the intercepts, as 
the clocks do not steeply drift apart. Thus, when combining 
intercepts according to Equation 2, the sum of the initial 
intercept values will have a major impact on the resulting 
intercept. The confidence interval of the final intercept will 
consequently increase linearly with the number of merge 
operations applied. 

4.5 Evaluation of Clock Synchronization Methods 

To compare the synchronization schemes of SKaMPI and 
Netgauge (NBCBench) to the competitors, we have extracted 
the relevant clock synchronization algorithms from their 
respective benchmarking framework. In particular, it means 
that we use a fixed window size and disable the dynamic 
adaptation implemented by SKaMPI and NBCBench. This 
allows for a fairer analysis of the clock drifts. Furthermore, 
we rely on scheme (MS4) of Table 3 for measuring run-times. 
The run-times are based on global times as described in 
Section 3.2.2. The tuple (N_EITPTS, N_EXCHANGES) needed 
for HCA and JK is specified in each figure. 

In the following experiments, we show results obtained 
for HCA when applying each of the two approaches we 
previously described. In the case of the first approach, we use 
a hierarchical way of computing the slopes and the linear 
way of obtaining the intercepts (Oflogp) -F 0{p) rounds), 
while the second approach computes both the slope and the 
intercept hierarchically in O{\ogp) rounds. The HCA method 
applying theJiVsf approach is simply denoted "HCA" in the 
legend of all the experiments, whereas the label "HCA2" is 
used for the second approach. 

In practice, the estimation of the drift slope using linear 
regression typically requires many more ping-pong messages 
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Figure 6. Clock offset directly after synchronizing the processes (10 calls to mpirun, MVAPICFI 2.1a, TUWien, Exp. details: Appendix C.5). 


than the offset computation with SKaMPI for a pair of 
processes. Thus, p — 1 SKaMPI rounds can be much shorter 
than O(logp) rounds of the hierarchical slope computation. 
In our experimental setting (e.g., number of processes), a 
simple analytic model revealed that the first approach of HCA 
does not incur a significant run-time overhead compared to 
the second approach, since the time for obtaining the values 
for the linear regression is dominating. As a consequence, 
we only show results for both implementations of the HCA 
method in this section and only apply the first approach in the 
remainder of the paper. 

In our first experiment, we apply each of the previously 
described synchronization methods to obtain a global clock 
for every process. Then, we measure the clock offset between 
the root process and each of the other processes directly after 
the synchronization phase has been completed. To that end, 
the root process exchanges a number of ping-pong messages 
with all other processes and estimates its clock offset relative 
to the global time computed on each process. 

Figure 6(a) presents the maximum clock offset measured 
between any process and the reference process directly after 
finishing the clock synchronization. We used one MPI process 
per compute node in this experiment. Let 
clock offset between process r and process root in ping-pong 
round j (in total mounds = 10). The maximum clock offset 
is computed as maxo<r<p(mino<j<„roH„ds(*j9?'_roof)) 
synchronization algorithm. Each experiment was repeated 
10 times (different calls to mpirun). The graph shows that the 
clock offset measured directly after synchronizing the clocks 
with SKaMPI or Netgauge is very small, i.e., we measured an 
offset of at most 0.2 ps for up to 8 different compute nodes. 
However, the method of Netgauge leads to significantly 
larger offsets when the number of processes (and therefore 
the number of synchronization rounds) increases. The JK 
synchronization method produces slightly larger clock offsets 
for a small number of processes (2-16) compared to SKaMPI 
and Netgauge, due to the inaccuracy of JK's approach for 
computing linear models. 

We also checked how the HCA algorithm compares to the 
other clock synchronization methods. Figure 6(a) shows that 
for up to 32 processes (1 process per node), the maximum 
clock offsets obtained with the first approach of HCA are 
similar to the offsets yielded by the SKaMPI method. In the 


particular case of 32 processes, the maximum clock offset is 
clustered around 0.25 ps, which represents an improvement 
over all other methods. The second approach of HCA leads 
to slightly larger offset values, while it still outperforms 
the method of Jones and Koenig, as well as the method of 
Netgauge for 32 processes. However, HCA-based clock off¬ 
sets show an increasing trend with the number of processes, 
which is a consequence of hierarchically combining linear 
models. 

The picture does not change for larger numbers of 
processes, as shown in Figure 6(b). Here, SKaMPI still syn¬ 
chronizes the distributed clocks with the highest precision, 
but the relative difference to JK is smaller. Netgauge, in 
contrast, will lead to the least synchronized clocks among its 
competitors for 256 or more processes on our machine, due 
to its hierarchical way of combining the computed offsets. 
Both approaches of the HCA method appear to be viable 
alternatives to SKaMPI, as they result in clock offsets in the 
same order of magnitude. 

However, it is important to recall that real clocks are 
drifting apart, as shown in Figure 1. To evaluate the syn¬ 
chronization methods in this scenario, we performed another 
experiment, in which we measured the clock offset over time 
(clock drift). The root process waits in a loop for a given 
amount of time (e.g., 1 s) and then measures its clock offset 
to all other processes. In this way, we can determine how 
much the logical global time is drifting on each process. 

Figure 7 presents the clock drift measured for 512 pro¬ 
cesses on 32 nodes of our 36 node cluster {TUWien). We 
see that the clock synchronization methods that account for 
the clock drift (JK and both HCA approaches) are superior 
to the ones that only compute the initial clock offset to a 
reference clock (Netgauge and SKaMPI). This experiment 
also highlights the disadvantage of hierarchically combining 
linear model intercepts in the case of HCA2, which shows 
larger clock offsets than HCA for 512 processes. 

While these results suggest that the method of Jones and 
Koenig leads to the most precise measurements for long 
execution times, its synchronization mechanism is slow, as it 
serializes the computation of linear models. We are therefore 
interested in understanding the trade-off between the most 
accurate clock offset that is obtainable and the time it takes 
to synchronize the processes. 
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Figure 7. Clock drift for 512 (32 x 16) processes after 0, 2, 4, ..20 s 
(distribution of maximum offsets over 10 calls to mpirun, MVAPICH 2.1a, 
TUWien, Exp. details: Appendix C.6) 


Figure 8. Median clock offset (after 5 s) vs. synchronization phase 
duration for 512 processes (32 x 16) (10 calls to mpirun, MVAPICH 2.1a, 
TUWien, Exp. details: Appendix C.8). 


Figure 8 shows the Pareto frontier of the clock offset 
versus the synchronization time, which visualizes the possi¬ 
ble configuration choices. We also added the mean time to 
complete a call to MPl_Barrier as a baseline. It provides 
an insight on the magnitude of process imbalance when 
synchronizing measurements through MPl_Barrier calls 
and a limit to the clock offset that is acceptable for window- 
based synchronization methods to prove useful in bench¬ 
marking contexts. The figure plots the clock offsets that were 
measured five seconds after completing each clock synchro¬ 
nization method. We see that the clock offsets obtained with 
Netgauge and SKaMPI are relatively large (« 80 ps), but both 
need less than one second to complete. In contrast, the time 
to complete the clock synchronization phases of JK and HCA 
depends on the number of sent ping-pong messages needed 
to compute the regression models. Thus, the parameters 
number of fit-points and number of exchanges have a strong 
influence on the quality of the clock synchronization. Figure 8 
indicates that both implementations of HCA are able to 
synchronize the clocks with a higher precision than what 
MPl_Barrier can provide, while only requiring at most 
10 s to finish the synchronization process. The method of 
Jones and Koenig, on the other hand, produces even smaller 
clock offsets, but requires at least 30 s (in the (100, 30) case) 
to complete. 

4.6 A Closer Look at MPi_Barrier 

So far, we have examined the accuracy of different clock 
synchronization methods for the window-based bench¬ 
mark scheme. It remains an open question how much the 
MPl_Barrier synchronization affects the results. 

The use of MPl_Barrier for process synchronization 
is portable but not necessarily reproducible or fair. Every 
MPI implementation might use a different algorithm to im¬ 
plement the barrier, with possibly different synchronization 
characteristics. The advantage of synchronizing processes 
with MPl_Barrier is that this method is independent of 
a logical global clock, and thus, subsequently measured 
run-times will not experience a drift. Further, processes 
will typically require a shorter waiting time compared to 
a window-based scheme, which makes the MPl_Barrier- 
benchmarks usually faster to complete a set of experiments. 
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Figure 9. Run-time of MPi_Aiireduce obtained when using window- 
based synchronization and MPi_Barrier-based synchronization and 
two different approaches for computing the run-time (32KiBytes, 
16 X 1 processes, 4000 runs, bin size: 100, MVAPICH 2.0a-qlc, VSC-3, 
Exp. details: Appendix C.9). 

However, we need to examine how well the synchronization 
using MPl_Barrier really works in practice. 

Typically, MPI benchmarks that use MPI_Barrier to 
synchronize processes between measurements define the 
run-time of an MPI function as the maximum local run¬ 
time measured on each process. The problem with this way 
of estimating the run-time is that it is assumed that all 
processes leave MPl_Barrier and enter the MPI call to 
be benchmarked almost synchronously. 

When we compared measurements obtained with 
window-based and MPl_Barrier-based schemes, we en¬ 
countered cases for which we initially had no explanation. 
The graph on the left-hand side of Figure 9 shows one 
of these experiments, where we compare the run-time of 
MPI_A11 reduce obtained with a window-based scheme 
(in which clocks were synchronized using HCA) to the 
run-time obtained when synchronizing with MPl_Barrier. 
The mean run-time of MPl_Allreduce when applying the 
MPl_Barrier synchronization was about 70 ps (computed 
by using local run-times), while the same call took approx¬ 
imately 100 ps using the window-based scheme. As this 
difference seemed very large, it needed further investigation. 

We repeated the experiment for MPl_Allreduce, but 
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Figure 10. Synchronization imbalance of MPi_Barrier implementations. 
Exit times of eaoh process relative to the first process that leaves the 
barrier (mean of 1000 measurements and 95% confidence intervals, 
16 X 1 processes, one mpirun, FICA synchronization, window size: 
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Figure 11. Distribution of normalized median run-times of MPi_Bcast 
for Intel MPI 5 and MVAPICFI 2.0a-qlc, when measurements are synchro¬ 
nized using a call to MPi_Barrier or an external barrier implementation 
(16 X 1 processes, 1000 measurements, 10 calls to mpirun, VSC-3, Exp. 
details: Appendix C.11). 


this time, while we still synchronized processes using 
MPl_Barrier, we measured global times on each pro¬ 
cess using our HCA method to normalize local times to 
the root's reference clock (cf. Section 3.2.2). The chart on 
the right-hand side of Figure 9 shows the resulting run¬ 
times of MP l_Allreduce for both synchronization methods 
(MPl_Barrier and window-based with tiCA), where all 
times were obtained using globally-synchronized clocks. 
Now, the resulting run-times are much closer and their 
difference can reasonably be explained by the way the two 
synchronization schemes work. 

Nevertheless, we still need to explain the gap between 
the observed run-times when we switch from local to global 
times to determine the overall run-time. Ideally, both run¬ 
time computation methods should lead to similar results. 
Therefore, we investigated the skew of MPI processes when 
they exit the MPl_Barrier function. For this purpose, 
we applied the TTCA method to synchronize clocks and 
recorded the global timestamp of each process at the end 
of the MPl_Barrier call. The results of this experiment are 
shown in Figure 10. The graphs compare the process skew 
after completing MPl_Barrier, measured with Intel MPI 5 
(left) and MVAPICH 2.0a-qlc (right). Surprisingly, a call to 
MPl_Barrier using MVAPICH 2.0a-qlc resulted in a large 
process skew. In particular, the mean exit times between 
process 0 and process 15 differed by more than 40 ps. This 
finding directly explains why the measurements in the 
previous experiment (cf. Figure 9) showed such a large 
difference in run-time. 

The experiments discussed here only indicate the po¬ 
tential impact of using MPl_Barrier to synchronize pro¬ 
cesses on MPI benchmarking results. They are not meant 
to point out potential performance problems of libraries 
such as MVAPICH. Therefore, we show results obtained 
with MVAPICH 2.0a-qlc, which is not the latest version of 
MVAPICH, but the one that was pre-installed on the system 
and for which we experienced this significant process skew. 

Last, we would like to demonstrate how misleading the 
run-time measurements can be when the experimenter reUes 
on MPl_Barrier for process synchronization. Figure 11 
compares the normalized run-times of MPl_Bcast obtained 
with either an external benchmark-provided dissemination 
barrier (cf. [20]) or the barrier implementation provided by 


each library. We have executed 10 distinct calls to mpirun, 
in each of which 1000 measurements were recorded. We 
compute the median of each sample and normalize the run¬ 
times for one message size to the median run-time of these 
10 medians of MVAPICH 2.0a-qlc. We observe, especially for 
the smaller message sizes (2® Bytes to 2^^ Bytes), that there 
is no clear winner between Intel MPI 5 and MVAPICH 2.0a- 
qlc when our own barrier implementation is used (left- 
hand side). However, when we employ the library-provided 
MPl_Barrier implementation for synchronizing processes, 
we see a significant performance difference between the 
libraries. In such cases, one could easily draw wrong conclu¬ 
sions. 

4.7 Summary of Distributed Ciock Synchronization 
Methods 

We have shown that the choice of a clock synchronization 
method used for MPI benchmarking has tremendous effects 
on the outcome. The clock synchronization method imple¬ 
mented by SKaMPI can achieve very accurate timings, but 
since the logical global clocks are drifting quickly, only a 
small number of MPI operations can be measured precisely, 
which of course depends on the length of each MPI function 
call. In case the experimenter wants to measure for a longer 
period of time (e.g., several milliseconds or even seconds), 
the approaches used in SKaMPI and Netgauge simply lead 
to inaccurate measurements. To overcome this problem, one 
could start re-synchronizing the clocks after a given amount 
of time has passed or use a clock synchronization algorithm 
that accounts for the clock drift. 

The approach of Netgauge is more scalable than the clock 
synchronization used in SKaMPI. However, it possesses an 
increased synchronization error since it combines estimated 
clock offsets, which themselves entail an error. Additionally, 
Netgauge uses a heuristic to estimate the CPU frequency, 
which is a potential source of error. 

The clock synchronization method of Jones and Koenig 
accurafely synchronizes a set of distributed clocks as it 
considers the clock drift between processes. Thus, it can be 
used for measuring MPI functions over a longer time span. 
This approach could be used if very accurate window-based 
measurements are required and if the relatively long time for 
completing the clock synchronization can be tolerated. 
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Our clock synchronization method, called HCA, can be 
seen as a trade-off between achieving a higher accuracy for 
longer runs like JK and providing the speed of Netgauge. It 
suffers from the same problem as Netgauge, as it combines 
models with an inherent experimental error. Nevertheless, in 
our MPl benchmarking setup the HCA algorithm emerged 
as the best option for process synchronization compared to 
MPl_Barrier, SKaMPl, or Netgauge when measurements 
over longer periods of time (we have tested for up to 20 s) 
for many processes are needed. 

Last, we note that using a library-provided implemen¬ 
tation of MPl_Barrier may lead to unforeseeable results, 
as processes can become significantly skewed when they 
leave MPl_Barrier. The decision whether to rely on 
MPl_Barrier should therefore be done after investigating 
the behavior of the implementation on the given network. 
Nevertheless, an MPl benchmark should provide its own 
MPI_Barrier implementation for fairly comparing two 
MPl libraries. 


5 Influencing Factors of MPl Benchmarks 

After examining the MPl benchmarking process, we now 
turn to characterizing and analyzing the performance data. A 
good understanding of the performance data is essential for 
selecting and applying the right statistical test for comparing 
MPl alternatives. But for a rigorous statistical analysis, we 
need a deeper insight into our system and the factors that 
influence the run-times to be measured. Le Boudec points out 
that "knowing all factors is a tedious, but necessary task. In 
particular, all factors should be incorporated, whether they 
interest you or not" [21]. 

Hence, we will first examine the shape of sampling 
distributions of run-time measurements. Then, we will ana¬ 
lyze potential experimental factors in the remainder of this 
section. However, we decided to exclude "obvious" factors 
of MPl performance experiments, such as the communication 
network, the number of processes, and the message size. 

5.1 Sampling Distributions of MPl Timings 

To apply a statistical hypothesis test, we need to make sure 
that all its assumptions are met. For example, many tests 
assume that the data foUow a specific probability distribution, 
e.g., the dataset is normally distributed. We now examine 
the experimentally obtained distributions of MPl function 
run-times. 

We first ran a large number of MPl experiments to 
investigate various sampling distribution of MPl timings. 
The experiments were conducted for several MPl functions 
such as MPI_Bcast, MPI_A11reduce, MPI_Alltoall, or 
MPl_Scan. Figure 12 shows the distribution of run-times for 
10 000 calls to MP l_Scan with a payload of 10 000 Bytes and 
to MPl_Allreduce with a payload of 1000Bytes, both for 
16 processes (one process per node). We used a kernel density 
estimator (density in R) to obtain a visual representation 
of the sampling distribution. The figure indicates that the 
sampling distributions are clearly not normal, and interest¬ 
ingly, in both distributions we can see two distinct peaks. 
The peak on the right is much smaller, but it appears in many 
of the histograms for small execution times (less than 200 ps). 
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Figure 12. Flistogram of the time needed to compiete a caii to MPi_scan 
with 10 000 Bytes (ieft) and to MPl_Allreduce with 1000 Bytes (right) on 
TUWien {MPi_Barrier synchronization, NECMPi 1.2.8, Exp. detaiis: 
Appendix C.14). 
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Figure 13. Distribution of sampie means and Q-Q piots when sam- 
piing using different sampie sizes from the probabiiity distribution for 
MPi_Aiireduce (cf. Figure 12). 


Similar distributions were obtained for experiments with 
MPl_Alltoall and MPl_Bcast, as weU as on other parallel 
machines (see Figures 41, 42, and 43 in Appendix D.6). 

Since the measured run-times do not follow a normal 
distribution, we must be careful when computing statistics 
such as the confidence interval for the mean. The central 
limit theorem (CLT) states that sample means are normally 
distributed if the sample size is large enough. In practice, we 
most often do not know in advance how large the sample 
size should be such that the CLT holds. Many textbooks, 
like the books by Lilja [22] or Ross [23], state that a sample 
size of 30 is large enough to obtain a normally distributed 
mean. However, Chen et al. [24] report in a recent study that 
samples need to include at least 240 observations, such that 
the sample means follow a normal distribution. 

We are interested in how many repetitions of a single 
measurement are needed within one call to mpirun such 
that the CLT holds for the computed sample mean. To answer 
this question, we analyzed distributions of sample means 
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by randomly sampling from the set of 10 000 previously 
measured MPI run-times of MPl_Allreduce (cf. Figure 12). 
In particular, we drew 500 samples containing 10, 20,..., 500 
observations each, computed the mean of each sample, and 
built a histogram of the sample means for each sample size. 
Figure 13 shows two histograms and their corresponding 
Q-Q plots for a sample size of 30 and 500. The data provides 
evidence that a sample size of 30 is not large enough to 
obtain a normally distributed sample mean. In our particular 
case, 500 observations were required so that the distribution 
of sample means was normally shaped. We therefore advise 
scientists to carefully verify the sample distributions in 
order to compute meaningful confidence intervals of the 
sample mean when benchmarking MPI functions. A similar 
suggestion has been made recently by Floefler and Belli [25]. 


5.2 Factor: The Influence of mpirun 

After taking a closer look at the results of the sampling 
experiment shown in Section 5.1, we noticed that the sample 
means were slightly different between calls to mpirun. To 
investigate the effect of mpirun, we conducted a series of 
experiments to determine whether distinct calls to mpirun 
produce different sample means (statistically significant).The 
experimental setup was the following: We executed 30 dis¬ 
tinct calls to mpirun and within each mpirun we mea¬ 
sured each individual run-time of 1000 calls to a given 
MPI function. Figure 14 presents a subset of the gathered 
experimental results. The graphs compare the means and 
their 95 % confidence intervals for 30 distinct calls to mpirun, 
a given MPI function, and a message size. 

The data yielded by this experiment provide convincing 
evidence that the run-time means obtained from distinct 
calls to mpirun are different. The differences between these 
means, however, are often not very large (3 %-5 %), yet they 
are statistically significant. 

Our finding that different calls to mpirun have a signif¬ 
icant effect on the experimental outcome is very important 
for designing MPI benchmarks. As a consequence, it is 
insufficient for an MPI benchmark to collect MPI run-time 
measurements only from a single call to mpirun. Instead, 
several calls to mpirun are required to correctly account 
for the variance between different calls. The problem of 
finding out how many calls to mpirun are needed is tightly 
connected to the statistical hypothesis test to be applied. We 
discuss this question in more detail in Section 6. 

Figure 15 (right) shows the distribution of 500 sample 
means of the run-time of MPI_A11 reduce obtained from 
500 distinct calls to mpirun. For every call to mpirun, 
we recorded 1000 run-time measurements and computed 
their mean. We can observe that the means are normally 
distributed. If the distributions obtained from different 
mpiruns are relatively similar, the normality distribution 
of the means is a consequence of the central limit theorem. 
However, we cannot formally assume that the computed 
means are normally distributed, as each mpirun could 
produce a completely different distribution of run-times. 
In such cases, we need to check for normality either using 
the Kolmogorov-Smirnov or the Shapiro-Wilk test [26]. 


5.3 Factor: Uncontrollable System Noise 

Several run-time distributions shown so far exhibited a 
longer tail or a second smaller peak on the right. Thus, 
it might be possible that subsequent measurements of MPI 
functions are similar. Then the question becomes whether 
different measurements, taken in sequence, are independent 
from each other. This verification of the measurement inde¬ 
pendence is essential, as virtually all statistical hypothesis 
tests assume that random variables are independent and 
identically distributed (iid). If this assumption is violated, 
statistical measures could be misleading, e.g., the computed 
confidence interval of the mean could be too small [21, p. 47]. 

LeBoudec suggests to evaluate the autocorrelation of 
the experimental data [21]. Consequently, we computed the 
autocorrelation of our experiments and show some of the 
results in Figure 16. Autocorrelation is typically used in 
time-series analysis, and it estimates the correlation between 
two values of the same variable measured at different times 
as a function of the time lag between them. In particular, 
the autocorrelation coefficient^ at lag h is computed as 
the ratio of the autocovariance Ch to the variance Cq. A 
significant correlation of measurements in the data can be 
seen when a line in the lag plot at a specific lag value exceeds 
the significance level. If all values were chosen randomly, 
for example from a normal distribution, then individual 
measurements are uncorrelated. Figure 16(a) shows that the 
experimental data exhibit a significant correlation between 
measurements. An immediate consequence is that not all 
assumptions for applying hypothesis tests hold true as 
measurements are correlated. 

One way to remove the correlation is the use of data 
sub-sampling, as stated by LeBoudec [21]. Indeed, when we 
sub-sample 1000 observations from the original sample of 
10 000 observations, the run-times become uncorrelated as 
shown in Figure 16(b). When we compare both histograms 
presented in Figure 16, we can observe that data sub¬ 
sampling has almost no effect on the sample mean. Hence, 
we do not apply a sub-sampling strategy in the remainder of 
the article, but we need to keep in mind that measurements 
are potentially dependent. 

5.4 Factor: Synchronization Method 

After introducing and discussing several clock synchroniza¬ 
tion methods in Section 4, we now want to evaluate their 
effect on MPI benchmarking results. 

We start by looking at the evolution of run-time measure¬ 
ments over a longer period of time in Figure 17. This graph 
compares the run-times of MPl_Allreduce measured using 
the synchronization method of Jones and Koenig with the 
ones obtained using an MPl_Barrier. The plot exposes two 
critical issues when measuring and analyzing MPI perfor¬ 
mance data. First, we see a significant difference between 
the mean and median run-times, which we computed for 
each bin of 10 000 runs. The difference is also present when 
outliers are removed (using Tukey's method, cf. Section 3.5). 
Second, the use of a window-based synchronization method 
might allow the experimenter to obtain more accurate results. 

3. http://www.itl.nist.gov/div898/handbook/eda/section3/ 
autocopl.htm 



16 


36- 
■ M ■ 35 - 
g34H 


16x1 procs, MPLBcast, 8192 Bytes 






§ 33^1 i 

^ * i. 




32- 




10 20 
distinct mpirun 


30 


(a) TUWien, NECMPI 1.2.8 


32x1 nodes, MPLBcast, 2048 Bytes 

220 


^210 

<p 

B 

’■+J 

P 200 


(b) VSC-1, Intel MPI 4.1 




* i i * i 

■ » * M 

• • 


0 10 20 30 

distinct mpirun 


16x1 procs, MPLBcast, 8192 Bytes 



Figure 14. Mean and 95% confidence interval of the time to complete a call to MPi_Bcast (with a different size of the payload) for 30 distinct calls to 
mpirun, using MPi_Barrier for synchronization (Exp. details: Appendix C.12). 
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Figure 15. Distribution of mean run-time values of MPi_Aiireduce 
and respective density estimation (16 x 1 processes, 1000 measure¬ 
ments, 500 calls to mpirun, JK synchronization, window size 1ms, 
MVAPICH 2.1a, TUWien, Exp. details: Appendix C.13). 


But even with a very precise clock synchronization method, 
such as the algorithm of Jones and Koenig, the run-times will 
gradually drift apart over time. 

Now, we would like to know how the different clock 
synchronization algorithms compare to each other in terms 
of clock drift. Figure 18 compares the resulting run-times for 
MPI_A11 reduce with a message size of 8192 Bytes obtained 
with different synchronization methods. The run-times are 
computed as medians of bins over 10 experiments, where 
for each experiment we binned every 100 measurements and 
computed their means. We can observe that a window-based 
approach might improve the accuracy of the execution time 
of MPI functions. For example, in Figure 18, the run-times 
measured with Netgauge, JK, or HCA are smaller than the 
ones measured with MPl_Barrier. However, as explained 
before, the run-times obtained with SKaMPl and Netgauge 
increase in time. 

From the experiments shown above, we also see that 
the differences between the run-times measured with either 
a window-based or an MPl_Barrier-based scheme are 
relatively small. For that reason, the practitioner may ask two 
questions: (1) Is MPl_Barrier good enough to reasonably 
compare MPI measurements? and (2) How large should the 


window size be to get accurate measurements for window- 
based synchronization schemes? 

In our opinion, question (1) cannot be answered generally 
as it depends on the actual goal of an experiment and the 
implementation of MPl_Barrier. If the experimenter seeks 
to obtain the most accurate timings for short-running MPI 
functions, the use of a window-based scheme is recom¬ 
mended. For a fair comparison of MPI implementations, 
relying on MPl_Barrier may be completely sufficient if the 
same MPl_Barrier algorithm is used by all of them. 

To investigate how large the window size should be 
in order to achieve a good trade-off between the number 
of correct measurements and the duration of the entire 
experiment, we varied the window size and recorded the 
number of out-of-sync measurements. The implementations 
of the window-based schemes found in Netgauge and 
SKaMPl increase the window size when the number of 
incorrect measurements exceeds some threshold. However, 
in the experiment shown in Figure 19, we keep the window 
size constant for comparison reasons. We can see that the 
percentage of measurements that need to be discarded is 
similar for all synchronization methods, when we measure 
the run-time of the MPl_Alltoall function 1000 times. It 
was also expected that this percentage decreases when the 
window size is increased as shown in the figure. 

Now, one open question remains: How large should the 
window (size) be? On the one hand, the larger the window 
size that we select, the more time will elapse, which will 
result in a larger clock drift. On the other hand, if the window 
size is too small, we will have to dismiss many measurements. 
For this reason. Figure 20 compares the mean run-times 
measured with increasing window sizes for different clock 
synchronization methods. The figure shows that the run¬ 
time of MPl_Scan obtained using the HCA synchronization 
method stays relatively stable regardless of the window 
size. This is not the case for the clock synchronization 
methods used in Netgauge or SKaMPl; here, the run-times 
increase when the window size grows. This behavior is 
again a consequence of ignoring the clock drift in their clock 
synchronization methods. 
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Figure 16. Distribution of run-times for MPi_Bcast and the corresponding autocorrelation plot (1000Bytes, 16 x 1 processes, 1000 runs, 
MPi_Barrier synchronization, NECMPI 1.2.8, TUWien, Exp. details: Appendix C.14). 
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Figure 17. Mean and median of run-times of MPi_Aiireduce when 
synchronizing either with the method of Jones and Koenig (window size: 
1ms) orMPl_Barrier (1000 Bytes, 16 X 1 processes, 500 000 runs, bin 
size: 10000, MVAPICFI 2.1a, TUWien, Exp. details: Appendix C.15). 
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Figure 19. Percentage of incorrect measurements for MPi_Aiitoaii 
(8192Bytes, 16 x 1 processes, 1000 measurements, 10 calls to mpirun, 
FICA synchronization, MVAPICFI 2.1a, TUWien, Exp. details: Ap¬ 
pendix C.16). 



Figure 20. Impact of the window size on the run-time of MPi_scan (me¬ 
dian run-time distribution of 10 calls to mpirun, 8192 Bytes, 16 x 1 pro¬ 
cesses, 1000 measurements, HCA synchronization, MVAPICFI 2.1a, 
TUWien, Exp. details: Appendix C.16). 


Figure 18. Drifting run-times of MPi_Aiireduce (median of 10 experi¬ 
ments, 8192 Bytes, 32 x 16 processes, 4000 runs, bin size 100, window 
size: 500ps, MVAPICFI 2.1a, TUWien, Exp. details: Appendix C.17). 

5.5 Factor: Pinning MPI Processes 

It is well-known that the performance of MPI applications 
might be sensitive to the way processes are pinned to 
CPUs, as pinning can influence several performance-relevant 
properties, such as the cache reuse or the applicability of 
intra-node communication. 

In the context of MPI benchmarking, CPU pinning is 
certainly required if we want to use the RDTSC instruction to 


measure the run-time, since unpinned processes might result 
in erroneous results (cf. Section 3.4). Yet, the more general 
question is: Does pinning affect the execution time of MPI 
functions? 

Figure 21 shows the results of an experiment in which 
we investigate whether the run-time of an MPI function 
changes if processes are pinned to CPUs or not (using 
MPl_Wtiine for time measurements). The figure presents the 
histograms of run-times for MPl_Allreduce and various 
message sizes. Each histogram is generated by accumulating 
all run-time measurements from 10 different calls to mpirun. 
We can clearly see a significant difference in the shape of 
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Figure 21. Pinning effect on the run-time of MPi_Aiireduce (256 pro¬ 
cesses (16 X 16), 1000 measurements, 10 calls to mpirun, MPi_wtime, 
HCA synchronization, window size: 1ms, NECMPI 1.2.11, TUWien (Exp. 
details: Appendix C.18). 
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Figure 22. Compiler effect on the run-time of MPi_Aiireduce (median 
run-time distribution of 30 calls to mpirun, 16 x 1 processes, 1000 mea¬ 
surements, FICA synchronization, window size: 1ms, MVAPICFI 2.1a, 
TUWien, Exp. details: Appendix C.19). 


the histograms and between the mean run-times, which are 
marked with a vertical line. Even though there could be 
cases where pinning has no effect on measurements, we 
have shown that pinning is an experimental factor to be 
considered in the context of MPl benchmarking. 

5.6 Factor: Compiler and Compiler Flags 

It seems self-evident to consider the compiler and the 
compiler flags as being significant experimental factors of 
MPl benchmarking applications. We still need to measure 
this effect to support our conjecture. 

We conducted an experiment in which we measured 
the run-time of a call to MPl_Allreduce with the same 
version of MVAPICH. We recompiled the entire library 
(MVAPICH 2.1a) with gcc 4.4.7, but for each experimental 
run we changed the optimization flag to either -01, -02, 
or -03. Figure 22 clearly shows that compiling the library 
using -03 outperforms the versions with other optimization 
flags. Even though it seems obvious, our message is this: if 
an MPl benchmarking experiment does not clearly state the 
compiler and the compilation flags used, the results will not 
be comparable or might not even be trustworthy. 
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Figure 23. DVFS configuration effect on MPi_Aiireduce run-times 
(median run-time distribution of 30 calls to mpirun, 16 x 1 processes, 
1000 measurements, FICA synchronization, window size: 1ms, MVA¬ 
PICH 2.1 a vs. NEC MPl 1.2.11, TUWien, Exp. details: Appendix C.20). 

5.7 Factor: DVFS 

The majority of today's processors offer dynamic voltage and 
frequency scaling (DVFS) capabilities to reduce the energy 
consumption of the chip. Changing the core frequency is 
therefore an obvious factor for computationally-intensive 
workloads. In this work, we investigate whether the choice 
of the DVFS level may alter the run-times of MPl operations. 

We conducted an experiment on TUWien, in which 
we compared the run-times of MPI_Allreduce for 
two different MPl implementations, MVAPICH 2.1a and 
NECMPI 1.2.11, and for two different DVFS levels, 2.3GHz 
and 0.8 GHz. Figure 23 presents the results of this experiment. 
The upper graph shows that MVAPICH outperforms NEC 
MPl for message sizes of up to Bytes when all processors 
are running at a fixed frequency of 2.3 GHz. In contrast, 
when we change the frequency to 0.8 GHz for all the 
processors, NEC MPl dominates MVAPICH for all message 
sizes. Additionally, we see that the individual run-times of 
MPI_A11 reduce increase significantly when reducing the 
cores' frequencies. 

The key observation is that the DVFS level needs to be 
carefully stated. Two MPl implementations may compare 
and behave differently depending on the chosen DVFS level. 

5.8 Factor: Warm vs. Cold Cache 

Gropp and Lusk [5] had already named the problem of 
"ignor[ing] cache effects" among the perils of performance 
measurements. They pointed out that the time to complete a 
send or receive operation depends on whether the send and 
receive buffers are in the caches or not. Therefore, mpptest 
uses larger arrays for sending and receiving messages, but 
the offset from where messages are sent or received is 
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Figure 24. Cold cache vs. warm cache: median run-times for 
MPi_Aiireduce (median run-time distribution of 10 calls to mpirun, 
16 X 1 processes, 1000 measurements, HCA synchronization, window 
size: 1ms, MVAPICFI 2.1a, TUWien, Exp. details: Appendix C.21). 

changed in a block-cyclic way at every iteration, to reduce 
the chance that data resides in cache. 

The influence of caching was shown by Gropp and Lusk 
using mpptest for measuring the run-time of point-to-point 
communication. In the present work, we investigate how 
large the effect of caching is on blocking collective MPI 
operations. Instead of using buffer-cycling, we implemented 
another approach: we assume that the size of the last level 
of data cache, which is private to each core, is known. On 
current hardware this is often cache level 2. Let the size 
of this data cache be Bytes. We allocate an auxiliary 
buffer containing Bytes. Now, we alter our MPI 
benchmark as follows: we overwrite the entire buffer 
(using memset) after each iteration, i.e., when one measure¬ 
ment of a collective MPI call has been completed. This way 
we attempt (since we do not know the hardware details) to 
ensure that our message buffer used for the MPI operation is 
not cached. 

The effect of caching is shown in Figure 24, in which we 
can see that the reuse of message buffers between subsequent 
MPI calls, in this case MPI_A11 reduce, has a significant 
impact on the run-time. As a result, MPI benchmarks must 
clearly state whether and how the caching of messages 
(buffers) is controlled. 

5.9 Summarizing Experimentai Factors 

Our initial goal was to allow a fair and reproducible com¬ 
parison of the performance of MPI implementations. A well- 
defined experimental design is one requirement to achieve 
that goal, and therefore, the analysis of experimental factors 
is of major importance. We have analyzed factors, such as 
compiler flags or cache control, and evaluated whether they 
have a significant effect on the experimental outcome. The 
influence of some factors on the performance measurements 
was not surprising, for example, we had expected that the 
DVFS level would affect the run-times. 

However, the experiments led to two main results: The 
first lesson we learned was that the execution time of MPI 
benchmarks varies significantly between calls to mpirun. 
As a consequence, a reproducible and fair comparison of 
run-time measurements requires that performance data are 


recorded from different calls to mpirun. The second lesson, 
that we found quite surprising, was that determining which 
MPI implementation is better for a given case depends on 
the configuration of the experimental factors. For example, 
the run-time of MPl_Bcast might be shorter with library A 
using DVFS level "low", but library B will provide a faster 
MPl_Bcast implementation in DVFS level "high". 

Of course, our list of examined experimental factors is 
not exhaustive, and we are aware that other factors could 
also impact the experimental outcome. One such example is 
the operating system. Since many of such factors are often 
uncontrollable, we need to address them statistically. 

In conclusion, we advise MPI experimenters to carefully 
list the settings of all experimental factors, besides the 
obvious factors such as number of processes, message size, 
and parallel machine. Table 6 is our proposal of a list of 
experimental factors that, we believe, should be attached to 
all MPI benchmark data. 


6 Statistically Rigorous and Reproducible 
MPI Benchmarking 

After investigating the factors that may influence results of 
MPI benchmarks, we now propose a method to compare MPI 
implementations by using statistical hypothesis testing. Our 
goal is to establish an experimental methodology that aims 
to reproduce the test outcome between several experiments. 

We motivate the need for a more robust evaluation 
method by showing the results in Figure 25. On the left- 
hand side, we see a comparison between the run-time of 
MVAPICH and NEC MPI when executing MPl_Allreduce 
with various message sizes. Each bar represents the mean run¬ 
time computed for 1000 individual measurements of a single 
call to mpirun. One might say that such a comparison is fair 
(due to the large number of repetitions) and we contend this 
is common practice when analyzing experimental results in 
the context of MPI benchmarking. However, when we look 
at the results shown on the right-hand side, the outcome 
changes significantly. For example, the ratio of mean run¬ 
times for a message size of 2^ Bytes has now changed. This 
observation matches the result of our factor analysis, in 
which we have discovered that the call to mpirun is an 
experimental factor (cf. Section 5.2). Therefore, we emphasize 
again that an MPI benchmark needs to collect data from 
multiple mpirun calls to be fair and reproducible. 
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Figure 25. Comparison of mean run-times of MPi_Aiireduce and different message sizes for two distinct caiis to mpirun (16 x 1 processes, 1000 
measurements per message size, HCA synchronization with window sizes adapted to each message size, MVAPiCH 2.1a vs. NEC MPi 1.2.11, 
TUWien, Exp. detaiis: Appendix C.22). 


6.1 Design of Reproducible Experiments 

Our new experimental design for measuring MPI per¬ 
formance data is shown in Algorithm 5. The procedure 
Benchmark generates the experimental layout and has five 
parameters, two of them being important for the statistical 
analysis: (1) n denotes the number of distinct calls to mpirun 
for each message size, and (2) nrep specifies the number 
of measurements taken for each message size in each call 
to mpirun. In total, we measure the execution time of 
a specific MPI function for every message size n ■ nrep 
times. In the procedure BENCHMARK in Algorithm 5, we 
issue n calls to mpirun, where the number of processes p 
stays fixed. To respect the principles of experimental design 
(randomization, replication, blocking) as stated by Mont¬ 
gomery [27], we randomize the experiment by shuffling the 
order of experiments within a call to mpirun. The procedure 
SCAN_Over_MPI_FuncTIONS has three parameters: the 
list of message sizes the list of MPI functions to be 

tested and the number of observations (nrep) to be 

recorded for each message size. The procedure creates a 
list (/“P) containing the experiments covering all message 
sizes and MPI functions. The order of elements in this list is 
shuffled before each item (experiment) is executed. 

The procedure BENCHMARK of Algorithm 5 is executed 
for each MPI implementation. After the measurement results 
have been gathered, we apply the data-analysis procedure 
shown in Algorithm 6. Here, we group run-time measure¬ 
ments by the message size, the type of MPI function, and the 
number of processes. We remove statistical outUers from each 
of these measurement groups. Last, we compute averages 
(the median and the mean) for each group of measurements 
and store them in a table. By applying this data-analysis 
method, we obtain a distribution of averages (medians or 


Algorithm 5 Design of MPI experiment. 

1: procedure BENCHMARK(p, n, nrep) 

//p - nb. of processes 
// n - nb. of mpi runs 
H jmsizc _ of message sizes 
// - list of MPI functions 

// nrep - nb. of measurements per run 
2: for i in 1 to n do 

3: mpirun -np p SCAN_OvER_MPI_FUNCTIONS(/"““', nrep) 

4: procedure SCAN_OvER_MPI_FUNCTIONS(i'““', nrep) 

5: r’’ r- 0 

6: for all msize in do 

7: forall/wncin//"”" do 

8: /“^.add( TlME_MPl_FUNCTlON(/T'mc, msize, nrep )) 

9: shuffle(/^^^) // create random permutation of calls in place 

10: for all exp in do 

11: call exp 


Algorithm 6 Analysis of benchmark data. 

1: procedure ANALYZE_RESULTS(r“'“, P, if"", n) 

U pnsize _ of message sizes 
// P - list of processes 
// - list of MPI functions 

// n - nb. of mpi runs 

2: for all msize C p C F,/zmc C do 

3: for i in 1 to n do 

4: [msize][p]\fiinc][i][j] for all 1 < j < nrep} 

5: — remove_outliers(Z^) 

6: v[msize][p]\func][i] — (median(/^),mean(/^^)) 

7: print v II fable with results 


means) over n calls to mpirun for each message size, MPI 
function, and number of processes. 

6.2 Fair Performance Comparison Through Statistical 
Data Analysis 

Now, the situation is as follows: we run our benchmark 
on two MPI implementations A and B, perform the data 
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Figure 26. Comparison of the run-time distributions obtained when measuring MPi_Aiireduce with a sample size of 10 (left) and 100 (right) per 
message size. The statistical significance was computed using the Wilcoxon test (16 x 1 processes, 30 calls to mpirun, FICA synchronization 
with window sizes adapted to each message size, MVAPICFI 2.1a vs. NEC MPI 1.2.11, TUWien, Exp. details: Appendix C.22). 


analysis according to Section 6.1, which gives us a distribu¬ 
tion of averages for each measurement point. The question 
then becomes: how can we compare the measured results 
in a statistically sound way? We could reduce all the values 
from the distribution to a single value using the minimum, 
the maximum, or the average, and then compare two MPI 
implementations based on this single value. However, our 
goal is to provide evidence that a measured performance 
difference has a high probability of being reproducible and 
that it is not merely a result of chance. 

Since we have two averages (the mean and the median) 
for each measurement group, we have several options for 
selecting a statistical test. If we use the computed median 
values as basis for our hypothesis test, we could employ 
the nonparametric Wilcoxon-Mann-Whitney test (Wilcoxon 
sum-of-ranks, in the remainder: WILCOXON TEST) for com¬ 
paring alternatives [28]. The advantage of the WILCOXON 
TEST (besides being nonparametric) is that it makes no 
assumption on the underlying distribution; in particular, 
it "does not require the assumption of normality" [23]. We 
could also employ the WILCOXON TEST on the distribution 
of means, but in this case the T-TEST for two independent 
samples is also a promising candidate. The T-TEST assumes 
that the underlying population is normally distributed and 
that the variances of both populations are equal [26]. We 
first have to make sure that our sample means computed 
for each mpirun are normally distributed. If the underly¬ 
ing distributions obtained from each mpirun are similarly 
shaped, then it is possible that also their means are normally 
distributed. For example, the Q-Q plot of the mean run-times 
(Figure 27) suggests that the distribution of means, which was 
presented in Figure 15 of Section 5.2, is normally shaped. In 
addition, the KoLmogorov-Smirnov and the Shapiro-Wilk test 
do not reject the null hypothesis, such that we can assume 
normality for the distribution of means. However, if the 
means follow a normal distribution, we also need to verify 
that the variances are equal. If the homogeneity of variance 
assumption is violated, several adaptations to the T-TEST 
have been proposed (cf. [26, p. 458], [17]). One adaptation is 
the so-called Welch's t-test that can be applied when two 


Normal Q-Q Plot 



Figure 27. Q-Q plot of mean run-times obtained from different calls to 
mpirun (cf. Figure 15). 

samples have unequal variances. 

hr the remainder of our analysis, we use the WILCOXON 
TEST exclusively. The reason is that the rigorous verification 
of the distributions of means (over mpiruns) showed that the 
mean run-times (obtained from mpiruns) are often normally 
distributed, but unfortunately not aU of them. Applying a T- 
TEST in cases in which the means are not normally distributed 
will not give us the desired statistical confidence, and the 
test results would be misleading (since the assumptions of 
the test are violated). 

We now demonstrate how to apply the WILCOXON TEST 
to our data and discuss why the test helps us to provide a fair 
comparison of MPI implementations. Figure 26 shows our 
statistical comparison method applied to run-times measured 
for MPl_Allreduce with both NEC MPI and MVAPICH. 
Let us focus first on the graph on the left of this figure, 
where we compare the distributions of means recorded 
for different message sizes. Each distribution contains 30 
elements, which are the median run-times measured in each 
of the 30 calls to mpirun. We apply the WILCOXON TEST 
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Figure 28. Comparison of the run-times of MPi_Aiireduce while 
applying the Wilcoxon test with “less” as alternative hypothesis (16 x 1 
processes, 30 calls to mpirun, FICA synchronization with window sizes 
adapted to each message size, MVAPICFI 2.1a vs. NEC MPI 1.2.11, 
TUWien, Exp. details: Appendix C.22). 


on the two distributions of medians for each message size. 
The test does not only report whether the null hypothesis 
(both population means are equal) is rejected or not, but it 
also provides a p-value. To obtain a graphical representation 
of the p-value and therefore the statistical significance, we 
represent the p-value by a sequence of asterisks. One asterisk 
(*) represents a p-value of p < 0.05, two asterisks denote 
p < 0.01, and three asterisks denote p < 0.001. It also 
means that if asterisks are absent in a specific case, the null 
hypothesis could not be rejected, and thus, the statistical test 
does not provide sufficient evidence which implementation 
is better. We used a significance level of 0.05 (5%) for all 
experiments. 

When we look at the left graph of Figure 26, in which we 
applied the WILCOXON test, we see that using a hypothesis 
test can indeed help to separate cases, for which a decision 
can hardly be made only by looking at the distributions. For 
example, the differences between the distributions for 2® 
and 2® Bytes seem to be negligible. However, the WILCOXON 
TEST reveals that there is evidence that the sample medians 
are different in the case of 2® Bytes, but not in the case of 
2® Bytes. 

The graph on the right of Figure 26 presents the results 
when applying the WILCOXON TEST with a sample size of 
100 per mpirun. It is not surprising that the variances of 
the distributions of the averages decrease, and thus, a larger 
sample size helps the hypothesis test to separate averages 
with a higher significance. 

The graphs in Figure 26 compare the run-time distribu¬ 
tions of two MPI implementations and show the statistical 
results when testing whether the population averages are 
equal. Yet, in a practical scenario one might rather ask a 
question like: Is MPI library X faster than library Y for 
MPI function F7 To answer such a question, we change the 
alternative hypothesis of the test to "less" (null hypothesis: 
Hq : PA = Pb, alternative hypothesis: Ha : pa < Pb, 
where p denotes the average). Figure 28 presents the results 
of the same experiments as shown in the bottom right comer 


of Figure 26, but now we check whether the run-time of 
MPI_A11 reduce is smaller with MVAPICH than with NEC 
MPI. 

We see that for the two cases 2^^ and 2^^ Bytes the null 
hypothesis could not be rejected, and thus, in these cases 
the run-time of MPl_Allreduce using MVAPICH is not 
smaller than when using NEC MPI. We note that this result 
does not immediately imply that NEC MPI is faster than 
MVAPICH in these cases. To verify this, the test should use 
the alternative hypothesis "greater". 


6.3 Evaluating the Outcome Reproducibility 

Until now, we have investigated the factors that potentially 
influence fhe benchmarking of MPI functions and have 
shown how statistical hypothesis tests help us to fairly 
compare the performance of two MPI libraries. One of our 
initial goals was to develop a benchmarking method that 
leads to a reproducible experimental outcome (see Table 1). 

To examine the reproducibility of our benchmarking 
method, we conducted the following experiment: We ran our 
benchmarking method (cf. Algorithm 5) for ntrial=30 times. 
Each of the ntrial runs gave us one distribution of run-times 
per message size, which contains n=30 values. Since we 
obtain a distribution of distributions, we collapse the inner 
distribution into a single value. To do so, we compute the 
mean of the n=30 values measured for one message size 
in each of the ntrial distributions. Then, we normalize the 
run-time values by computing the ratio of each mean to 
the minimum mean. We obtain a distribution of ntrial=30 
normalized run-time values for our benchmarking method, 
presented in Figure 29(c). We can observe that the maximum 
relative difference between the 30 runs is very small (less 
than 5 % for 2^"^ Bytes). 

As a comparison, we also conducted ntrial=30 runs of the 
Intel MPI Benchmarks 4.0.2 and SKaMPl 5. We used the stan¬ 
dard configuration of the two benchmark suites (in particular, 
we used the default values of the number of repetitions for 
each message size). We compute the normalized run-time of 
each measurement for a specific message size as follows: 
tm°sfz7!i = tmsize,ilt^size' for alH, 1 < f < ntrM = 30, where 
Csize = mini<i<„t™; (fmsize.i)- We can see in Figure 29(a) 
and Figure 29(b) that the normalized run-times of Intel 
MPI Benchmarks and SKaMPl exhibit a significantly larger 
variance for smaller message sizes than our benchmarking 
approach. The higher variance can be explained by the 
influence of system noise on experiments with small message 
sizes. In such cases, an MPI benchmark needs to record 
a sufficiently large number of repetitions across several 
calls to mpirun. Unfortunately, the Intel MPI Benchmarks 
and SKaMPl simply do not implement such reproducibility 
policies. 

Overall, we can state that our benchmarking approach 
notably improves the reproducibility of the performance 
results compared to the Intel MPI Benchmarks and SKaMPl. 
The price for a better reproducibility, however, is a longer 
run-time of the overall benchmark, caused by the need to 
take into account the clock drift between processes and to 
record a larger number of measurements per message size. 
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Figure 29. Distribution of normalized run-times reported by the Intel MPI Benchmarks (left), SKaMPI (center), and our method (right) when testing the 
performance of MPi_Bcast for various message sizes (16 x 1 processes, FICA synchronization with window sizes adapted to each message size 
for Figure 29(c), MVAPICFI 2.1a, TUWien, Exp. details: Appendix C.23). 


7 Related Work 

The statistically rigorous analysis of experimental data has 
been the focus of numerous studies over the last years, driven 
by the need for establishing a fair comparison of algorithms 
across different computing systems. 

Vitek and Kalibera contend that "[i]mportant results 
in systems research should be repeatable, they should be 
reproduced, and their evaluation should be carried with 
adequate rigor". They show that a correct experimental 
design paired with the right statistical tests is the cornerstone 
for reproducible experimental results [29]. The authors stress 
the fact that knowing and understanding the controllable 
and uncontrollable factors of the experiment is crucial for 
obtaining sound experimental results. 

The state of performance evaluation in Java benchmark¬ 
ing was investigated by Georges et al. [30]. They examined 
the performance of different garbage collectors for the Java 
Virtual Machine (JVM). The paper demonstrates that the 
answer to the question of which garbage collector is faster 
changes completely depending on the performance values 
investigated (e.g., mean, median, fastest, etc.). The authors 
show how to conduct a statistically rigorous analysis of JVM 
micro-benchmarks. In particular, they explain the need for 
considering confidence intervals of the mean and show that 
the Analysis of Variance (ANOVA) can be used to compare 
more than two alternatives in a sound manner. 

Mytkowicz et al. dedicated an entire article to the 
problem of measurement bias in micro-benchmarks [31]. 
The authors examine the run-time measurements of several 
SPEC CPU2006 benchmarks, when each benchmark is either 
compiled with the optimization flag -02 or -03. In theory, 
the programs compiled with -03 should run faster than the 
ones compiled with -02. However, the authors discovered 
that the resulting performance not only depends on obvious 
factors such as the compilation flags or the input size, but 
also on less obvious factors, such as the link order of object 
files or the size of the UNIX environment. A possible solution 
is to apply a randomized experimental setup. Please refer to 
the books of Box et al. [32] and Montgomery [27] for more 
details on randomizing experiments. 

Touati et al. developed a statistical protocol called 
Speedup-Test that can be used to determine whether the 
speedup obtained when modifying an experimental factor, 
such as the compilation flag (-03), is significant [33]. The 
article presents two tests, one to compare the mean and 


one to compare the median execution times of two sets of 
observations. For a statistically sound analysis, they base 
both Speedup-Test protocols on well-known tests, such as 
the Student's T-TEST to compare means or the Kolmogorov- 
Smirnov test to check whether two samples have a common 
underlying distribution. 

Chen et al. proposed the Hierarchical Performance Testing 
(HPT) framework to compare the performance of computer 
systems using a set of benchmarks [24]. The authors first 
contend that it is generally unknown how large the sample 
size needs to be, such that the central limit theorem holds. 
They show that for some distributions a sample size "[i]n 
the order of 160 to 240" is required to apply statistical 
tests that require normally distributed data [24]. Since such 
a high number of experiments seems infeasible for them, 
they propose a nonparametric framework to compare the 
performance improvement of computer systems. The HPT 
framework employs the nonparametric Wilcoxon Rank-Sum 
Test to compare the performance score of a single benchmark 
and the Wilcoxon Signed-Rank Test to compare the scores 
over all benchmarks. 

Gil et al. presented a study on micro-benchmarking on the 
JVM, in which they show that the mean execution times over 
several JVM invocations may significantly differ [34]. The 
described effect is very similar to the work presented here, 
as our micro-benchmark also needs to start an environment 
(the MPI environment using mpirun), which can affect the 
mean run-time. 

8 Conclusions 

We have revisited the problem of benchmarking MPI func¬ 
tions. Our work was motivated by the need (1) to fairly 
compare MPI implementations using a sound statistical 
analysis and (2) to allow a reproducibility of the experimental 
results. 

We have experimentally shown that the clock and process 
synchronization methods used to benchmark MPI functions 
have a tremendous effect on the run-time. We have also 
pointed out that the use of MPl_Barrier can potentially 
skew processes in such a way that the run-times measured 
are meaningless. To overcome the problem of synchronizing 
processes with MPl_Barrier, we have investigated the 
window-based approach, for which we require globally 
synchronized clocks. We have shown that it is essential to 
consider the clock drift between processes when seeking 
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accurate MPI timings. For this reason, the clock synchro¬ 
nization methods used in Netgauge or SKaMPI—that only 
determine clock offsets—will introduce a larger run-time 
error into run-time measurements of MPI functions unless 
the experiment is very short-lived. 

We have analyzed experimental factors of MPI experi¬ 
ments, for example, we have demonstrated that changing 
the DVFS level or the compiler flags can alter the outcome of 
the MPI benchmark. However, our most important finding 
is that a call to mpirun is a factor of the experiment, i.e., 
different calls to mpirun can produce significantly different 
means (or medians), even if all other factors and the input 
data stay unmodified. 

After investigating the implications and consequences of 
various synchronization methods and experimental factors, 
we have proposed a novel MPI benchmarking method. We 
have shown how to apply hypothesis tests such as the 
WiLCOXON TEST to increase the fairness and the evidence 
level when comparing benchmarking data. Last, we have 
demonstrated that our benchmarking method also improves 
the reproducibility of results in such a way that the measured 
performance values exhibit a much smaller variance across 
different experiments compared to other MPI benchmark 
suites. 

Acknowledgments 

We thank Maciej Drozdowski (Poznan University of Tech¬ 
nology), Peter Filzmoser (TU Wien), Friedrich Leisch and 
Bernhard Spangl (University of Natural Resources and 
Life Sciences, Vienna), Jesper Larsson Traff (TU Wien), Alf 
Gerisch (TU Darmstadt), and Thomas Geenen (SURFsara) 
for discussions and comments. Experiments presented in 
this paper were carried out using the Grid'5000 testbed, 
supported by a scientific interest group hosted by Inria and 
including CNRS, RENATER and several Universities as well 
as other organizations (see https://www.gridSOOO.fr). 

References 

[1] "Intel(R) MPI Benchmarks," http://software.intel.com/en- 
us / articles / intel-mpi-benchmarks. 

[2] D. A. Grove and P. D. Coddington, "Communication benchmarking 
and performance modelling of MPI programs on cluster computers," 
The Journal of Supercomputing, vol. 34, no. 2, pp. 201-217, 2005. 

[3] A. L. Lastovetsky, V. Rychkov, and M. O'Flynn, "MPIBlib: Bench¬ 
marking MPI communications for parallel computing on homo¬ 
geneous and heterogeneous clusters," in EuroPVM/MPI, 2008, pp. 
227-238. 

[4] J. L. Traff, "mpicroscope: Towards an MPI benchmark tool for 
performance guideline verification," in EuroMPI, 2012, pp. 100-109. 

[5] W. Gropp and E. L. Lusk, "Reproducible measurements of MPI 
performance characteristics," in EuroPVM/MPI, 1999, pp. 11-18. 

[6] T. Hoefler, A. Lumsdaine, and W. Rehm, "Implementation and 
performance analysis of non-blocking collective operations for MPI," 
in Proceedings of the 2007 ACM/IEEE Conference on Supercomputing 
(SC). ACM, 2007, pp. 1-10. 

[7] "OSU MPI benchmarks," http://mvapich.cse.ohio- 
state.edu/benchmarks/. 

[8] "Phloem MPI Benchmarks," https://asc.llnl.gov/sequoia/ 
benchmarks/. 

[9] R. Reussner, P. Sanders, and J. L. Traff, "SKaMPI: a comprehensive 
benchmark for public benchmarking of MPI," Scientific Program¬ 
ming, vol. 10, no. 1, pp. 55-65, 2002. 


[10] D. Grove, "Performance modelling of message-passing parallel 
programs," Ph.D. dissertation. University of Adelaide, 2003. 
[Online]. Available: http://hdl.handle.net/2440/37915 

[11] T. Hoefler, T. Schneider, and A. Lumsdaine, "Accurately measuring 
collective operations at massive scale," in Proceedings of the 22nd 
IEEE International Parallel & Distributed Processing Symposium, 
PMEO'08 Workshop, 04 2008. 

[12] -, "Accurately measuring overhead, communication time and 

progression of blocking and nonblocking collective operations 
at massive scale," International Journal of Parallel, Emergent and 
Distributed Systems, vol. 25, no. 4, pp. 241-258, 2010. 

[13] A. D. Kshemkalyani and M. Singhal, Distributed Computing: Prin¬ 
ciples, Algorithms, and Systems, 1st ed. New York, NY, USA: 
Cambridge University Press, 2008. 

[14] T. Worsch, R. Reussner, and W. Augustin, "On benchmarking 
collective MPI operations," in EuroPVM/MPI, ser. LNCS, no. 2474, 
2002, p. 271-279. 

[15] AMD, "AMD64 Architecture Programmer's Manual Volume 
2: System Programming, rev. 3.23," http://amd-dev.wpengine. 
netdna-cdn.com/wordpress/media/2012/10/24593_APM_v21. 
pdf, 2015, accessed: 17/02/2015. 

[16] Intel, "Intel 64 and IA-32 Architectures Software Devel- 
oper?s Manual, Volume 2B: Instruction Set Reference, 
N-Z, Order Number: 253667-053US," http: / / www.intel. 
com / content / w ww / us / en/ architecture- and- technology / 

64-ia- 32- architectures- software- developer-vol- 2b-manual.html, 
2015, accessed: 17/02/2015. 

[17] R. Hogg and E. Tanis, Probability and Statistical Inference. Prentice 
Hall, 2006. 

[18] F. Cristian, "Probabilistic clock synchronization," Distributed Com¬ 
puting, vol. 3, no. 3, pp. 146-158,1989. 

[19] T. Jones and G. A. Koenig, "Clock synchronization in high-end 
computing environments: a strategy for minimizing clock variance 
at runtime," Concurrency and Computation: Practice and Experience, 
vol. 25, no. 6, pp. 881-897, Jul. 2012. 

[20] G. Taubenfeld, Synchronization Algorithms and Concurrent Program¬ 
ming. Upper Saddle River, NJ, USA: Prentice-Hall, Inc., 2006. 

[21] J.-Y. Le Boudec, Performance Evaluation of Computer and Communica¬ 
tion Systems, ser. Computer and communication sciences. EFPL 
Press, 2010. 

[22] D. Lilja, Measuring Computer Performance. Cambridge University 
Press, 2005. 

[23] S. Ross, Introductory Statistics. Elsevier Science, 2010. 

[24] T. Chen, Y. Chen, Q. Guo, O. Temam, Y. Wu, and W. Hu, "Statistical 
performance comparisons of computers," in HPCA. IEEE, 2012, 
pp. 399M10. 

[25] T. Hoefler and R. Belli, "Scientific benchmarking of parallel 
computing systems: twelve ways to tell the masses when 
reporting performance results," in Proceedings of the International 
Conference for High Performance Computing, Networking, Storage 
and Analysis, SC 2015, 2015, pp. 73:1-73:12. [Online]. Available: 
http://doi.acm.org/10.1145/2807591.2807644 

[26] D. J. Sheskin, Handbook of Parametric and Nonparametric Statistical 
Procedures, 4th ed. Chapman & Hall/CRC, 2007. 

[27] D. C. Montgomery, Design and Analysis of Experiments. John Wiley 
& Sons, 2006. 

[28] M. Hollander and D. Wolfe, Nonparametric Statistical Methods, ser. 
Wiley Series in Probability and Statistics. Wiley, 1999. 

[29] J. Vitek and T. Kalibera, "R3: Repeatability, reproducibility and 
rigor," SIGPLAN Not., vol. 47, no. 4a, pp. 30-36, Mar. 2012. [Online]. 
Available: http://doi.acm.org/10.1145/2442776.2442781 

[30] A. Georges, D. Buytaert, and L. Eeckhout, "Statistically rigorous 
Java performance evaluation," in OOPSLA, 2007, pp. 57-76. 

[31] T. Mytkowicz, A. Diwan, M. Hauswirth, and P. F. Sweeney, 
"Producing wrong data without doing anything obviously wrong!" 
in ASPLOS, 2009, pp. 265-276. 

[32] G. Box, J. Hunter, and W. Hunter, Statistics for Experimenters: Design, 
Innovation, and Discovery. Wiley-Interscience, 2005. 

[33] S. A. A. Touati, J. Worms, and S. Briais, "The Speedup-Test: 
A Statistical Methodology for Program Speedup Analysis and 
Computation," Concurrency and Computation: Practice and Experience, 
vol. 25, no. 10, pp. 1410-1426, 2013. 

[34] J. Gil, K. Lenz, and Y. Shimron, "A microbenchmark case study and 
lessons learned," in SPLASH Workshops, 2011, pp. 297-308. 



25 


Appendix A 
List of Variables 

variable data type description 


P 

msize 

nrep 

n 

win_size 

startjtime 

func 

jmsize 

jfunc 

j^exp 

root 

my_rank 

t 

s_time, ejime 

IP 

P 

j^s_time je_time 

myoffset 

diff 

iMff 

rtt 

ytt 

N_FITPTS 

N_EXCHANGES 

N_PINGPONGS 

Im 

jmodel 

tremote 

tlocal 

tglobal 

r, server, client, p_ref 


uint 

uint 

uint 

uint 

double 

double 

address 

list<uint> 

list<address> 

list<exp> 

uint 

uint 

double 

double 

list<uint> 

array<double> 

array<double> 

double 

double 

list<double> 

double 

list<double> 

uint 

uint 

uint 

tuple (double, dou¬ 
ble) 

list<address> 

uint 

uint 

uint 

uint 


number of processes 
message size 

number of repetitions (in one mpirun) 
number of calls to mpirun 
window size 

start time in window-based synchronization 

MPl function 

list of message sizes 

list of MPl calls 

list of experiments {exp is a 3-tuple {msize,func, nrep)) 

rank of root process 

rank of current process 

run-time 

timestamps 

list of processes 

list of run-times 

list of timestamps 

clock offset of the current process 

clock offset 

list of clock offsets 

round trip time (rtf) 

list of rtfs 

number of points to fit linear model (HCA, JK) 
number of ping-pong messages exchanged to record 
one fitpoint (HCA, JK) 

number of ping-pong messages for rtt estimation 
linear model of the clock drift defined by a tuple {slope, 
intercept) 

list of linear models 

current time on remote process 

local time of current process 

global (normalized) time of current process 

process ranks 
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Appendix B 

Pseudo-Codes of Clock Synchronization 
Methods 

B.1 The SKaMPI Benchmark 


Algorithm 7 Clock offset between two processes. 

1: function SKAMPI_PlNGPONG{pl, p2) 

2; td_min — —oo 

3: td_max — oo 

4: N_PINGPONGS = 100 

5: for i in 0 to N_PINGP0NGS — 1 do 

6: if my_rank —— pi then 

7: sjast ^ Get_Time{) 

8: MPI_Send(s Jast, 1, mpi_double, p2) 

9: MPI_RECV(U^sf, 1, MPI_DOUBLE, p2) 

10; s_now — Get_Time() 

11: tdjnin — MAX(td_min, t_last — sjiow) 

12; tdjnax — MlN(td_max, tjast — s_last) 

13; else if my_rank == p2 then 

14; MPI_RECV(sJast, 1, MPI_DOUBLE, pi) 

15: tjast ^ Get_Time() 

16; MPI_SEND(MflSt, 1, MPI_DOUBLE, pi) 

17: t_now — Get_Time() 

18; tdjnin — MAX{tdjnin, sjast — tjiozv) 

19: tdjnax — MlN(tdjnax, sjast — tjast) 

20; diff — (tdjnin + tdjnax) {2 

21; return diff 


Algorithm 9 Timing procedure (SKaMPI and Netgauge) 

1; procedure MEASURE(/Totc, msize) 

2: sjime = Get_Time() 

3: func(msize) 

4; ejime = Get_Time() 

5: t — ejime — sjime 

6: return t 


Algorithm 8 Clock offsets measurement relative to the root 
and synchronization window initialization. 

- list of clock offsets of the current process relative to each of 
the others 

start Jime - timestainp of the first synchronization window 
1: procedure Compute_And_Set_Clock_Offsets 
2: for i in 0 to p — 1 do 

3: = 0 

4; for i in 0 to p — 1 do 
5; MPI_Barrier() 

6: if myj-ank —— root then 

7: = SKAMPl_PlNGPONG(roof, i) 

8: else if myj-ank —— i then 

9: f'fflroot] = SKAMPI_PlNGPONG(roof, i) 

10; if myjank —= root then 
11: tmp = 

12; MPI_BCAST(fmp, p, MPl_DOUBLE, root) 

13; for i in 1 to p — 1 do 

14: — tmp[i] + l‘^'^[root] 

15: procedure Initialize_First_Sync_Window 
16; if myjank —— root then 

17; start Jime — Get_Time() 

18: MFl_BCAST{startJime, 1, MPl_DOUBLE, root) 

19; function Start_SYNC( win_size, counter) 

20: sync_error — 0 

21: nextjvin — start Jime — l‘^'^[root] + (counter + 1) • win_size 

22: time ^ Get_Time() 

23; if time > nextjvin then 

24: sync_error — STARTED_LATE 

25: while time < nextjvin do 

26; time ^ Get_Time() 

27; return sync_error 

28; function STOP_SYNC(zvin_size) 

29: sync_error — 0 

30: time ^ Get_Time() 

31: if time — nextjvin > zvin_size then 

32; sync_error = TOOK_TOO_LONG 

33; return sync_error 


Algorithm 10 The SKaMPI benchmark. 

1; procedure Benchmark(/-^”‘^, ynsize^ maxjep, minjep) 

I 1- MPI functions to benchmark 

H I^msize _ ^jr 5/2eS 

// max_rep/min_rep - max/min number of measurements for each message size 
max_std_err - max standard error of measurements 

2; Compute_And_Set_Clock_Offsets() 

3: Initialize_First_Sync_Window() 

4: for msize in & func in if”' do 

5: zvin_size — 0 

6: while stop ^ TRUE do 

7: set nrep 

8: for 2 in 0 to nrep — 1 do 

9: jerror.toij^j _ START_SYNC(a)m_S!Ze, i) 

10: [i] = MEASURE(/Mnc, msize) 

11: ^ [j]+STOP_SYNC(mi)J_size) 

12; compute total Jime 

13: MPl_GATHER(i nrep, MPI_DOUBLE, 

l‘, nrep, MPI_DOUBLE, root) 

14: MPl_ALLREDUCE(i‘'''''"-°“', nrep, MPI_DOUBLE, 

I""', nrep, MPI_DOUBLE, MPI_MAX) 

15: if my_rank —— root then 

16: /'J’”'.add({ l'[i] M i < nrep s.t. i™[i] == 0}) 

17: if n_errors > nrep jA then 

18: win_size —MAX(2 ■ win_size, total_time/ {nrep + 1) ■ 1.5) 

19: max_consec_errors — MAX{(j — i) s.t. 

r^'lk] > 0. i < fc < i ) 

20: if max_consec_errors > nrep/2 then 

21: nrep = MAX{nrep/2, 4) 

22: std_error = COMPUTE_Std_Error(/'-^”') 

23: stop — (LEN(i^-^”‘’b > max_rep) or 

(LEN(i^-^”‘’b > min_rep & std_error < max_std_error) 
24: MPl_BCAST(H;m_size, 1, MFI_DOUBLE, root) 

25: MPl_BCAST(nrep, 1, MPI_DOUBLE, root) 

26: MPl_BCAST(sfop, 1, MPI_DOUBLE, root) 

27: 1nitialize_First_Sync_W indow() 

28: print 
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B.2 Netgauge/NBCBench Synchronization 


Algorithm 11 Clock offsets measurement relative to the root 
and synchronization window initialization. 

r - current process rank 
p - number of processes 

maxpower — 

. list of clock offsets of the current process relative to each of 
the others 

myoffset - clock offset of the current process relative to root 
start Jime - next window start time, updated after each sync 


1 : 

2 : 

3: 

4: 

5: 

6 : 

7: 


9: 

10 : 

11 : 

12 : 

13: 

14: 

15: 

16: 


procedure Sync_Clocks_POw2 

// compute clock offsets of processes with ranks between 0 and maxpower — 1 
round — 1 

if r > maxpower then return 
while < maxpower do 

if r mod 2'’‘^“”‘^ == 0 then H client 
server - r + 

[server] — COMPUTE_OFFSET{r, server) 

I I receive time differences collected by the server 

MPl_RECv{recvdiffs, 2''*^""'^“^ - 1, mpi_double, server) 

// compute final time differences 

for i in 0 to ( 2 '''’“”'^“^ - 2) do 

[server + i + 1] = [server] + recvdijfs[i] 
if r mod = 2''‘^und-i then//server 

client - r - 2™"'^''"! 
diff — COMPUTE_OFFSET(c/ienf, r) 

!I send all the time differences to the client 

MPI_SEND(/'''^[r + 1], 2''^“''^'“^ - 1, MPI_DOUBLE, client) 
round — round + 1 

H send final time differences from root to all processes 
MPI_SCATTER(Z'^'^, 1, MPI_DOUBLE, 

myoffset, 1, mpi_double, root) 


17: procedure Sync_Clocks_Remaining 

// compute clock offsets of processes with ranks between maxpower and p — 1 
18: if maxpower == p then return 

19: if r < p — maxpower then 

20: server = r + maxpower 

21: diff — COMPUTE_OFFSET(r, server) 

22: diff — diff + myoffset 

23: MPI_SEND(dij^, 1, MPl_DOUBLE, server) 

24: else if r > maxpower then 

25: client — r — maxpower 

26: diff — COMPUTE_OFFSET(d/cnf, r) 

27: MPI_RECV(mi/q^scf, 1, MPl_DOUBLE, client) 

28: procedure Initialize_First_Sync_Window 
29: N ^ 10 

30: local_bcasttime = run-time of N executions of MPI_Bcast 

31: MFl_REDlJCE{local_bcasttime, bcasttime, 1, mpi_double, root) 

32: start_time — Get_Time() -1- bcasttime 

33: MRl_BCAST{start_time, 1, mpi_double, root) 

34: start_time — start_time — myoffset I I adjust next start time to local clock 


Algorithm 12 Clock offset between two processes. 

1: procedure COMPUTE_OFFSET(dzcMf, server) 

2: N_PINGPONGS = 100 

3: rtt ^ 0 

4: while rtt < min(last n_pinGPONGS) do 

5: if my_rank =— client then 

6: s_time — Get_Time() 

7: MPI_SEND(s_h‘me, 1, mpi_double, server) 

8: MPI_RECv(fremofe, 1, mpi_double, server) 

9: e_time — Get_Time() 

10: rtt = e_time — s_time 

11: diff = s_time rtt/2 — tremote 

12: if my_rank —— server then 

13: MPI_RECV(s_hme, 1, MPl_DOUBLE, client) 

14: tremote — Get_Time() 

15: MPI_SEND(fremofc, 1, mpi_double, client) 

16: return diff 


Algorithm 13 Synchronization function. 

1: procedure SYNC(win_size) 

2: time = Get_Time() 

3: if time > start_time then // sync started too late 

4: err — time — start_time 

5: else 

6: while time < start_time do 

7: time — Get_Time{) 

8: start_time — start_time win_size 

9: return err 


Algorithm 14 NBCBench measurement procedure. 

1: procedure BENCHMARK(/'"®'^^,/wnc, nrep) 
jj i^msize _ hgt of message sizes 
Hfunc - MPI function to benchmark 
// nrep - number of measurements for each message size 
2: SYNC_ClOCKS_POW2() 

3: SYNC_ClOCKS_REMAINING{) 

4: for msize in do 

5: for i in 0 to warmup_ROUNDS do 

6: MEASURE(/Mnc, msize) 

7: for i in 0 to {nrep j 10 — 1) do 

8: runtimes[i] = MEASURE(/wnc, msize) 

9: local _est_r untime — }sAlN{runtimes) 

10: MPI_ALLREDUCE(/ocfl/_csf_rwMhme, 1, mpi_double, 

estimated_runtime, 1, mpi_double, mpi_max) 
11: if estimated_runtime ■ 5 • nrep > 10 seconds then 

12: nrep — MAX(nrep/2, 4) 

13: MPI_BCAST{nrep, 1, MPl_DOUBLE, root) 

// main measurement loop 

14: Initialize_First_Sync_Window{) 

15: for 2 in 0 to nrep — 1 do 

16: ^ SYNC(lOT'«_size) 

17: = MEASVRE(func, msize) 

18: MPl_ALLREDUCE(f™''-“, nrep, MPI_DOUBLE, 

1"'°', nrep, MPI_DOUBLE, MPI_MAX) 

19: yt<nrep s.t. = 0} 

20: if n_errors > 0.25 ■ nrep or len(/^-‘'‘^‘’^) < 4 then 

21: win_size — win_size ■ 1.5 

22: REPEAT measurement for current msize 

23: MP1_Gather(; nrep, MPI_DOUBLE, 

l‘, nrep, MPI_DOUBLE, root) 

24: print l‘ 
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B.3 Clock Synchronization Algorithm of Jones & 

Koenig (JK) 


Algorithm 15 Linear Model of the clock drift. 

p - number of processes 
r - current process rank (0 to p — 1) 

Im - linear model of the current process (defined by a slope and an 
intercept) to adjust the local clock to the reference time of root 
start Jime - next window start time, updated after each synchronization 

1: function Learn_Model(n_fitpts, n_exchanges, rtt, root) 

2: slope — 0, intercept — 0 

3: if my_rank —— root then 

4: for idx in 0 to n_eitpts — 1 do 

5: for r in 0 to (p — 1) & r ^ root do 

6: for z in 0 to N_EXCHANGES — 1 do 

7: MPI_RECV(fdMmmy, 1, MPl_DOUBLE, r) 

8: tremote — Get_Time() 

9: MPI_SEND(frcmofe, 1, mpi_double, r) 

10; if my_rank ^ root then 

11: for idx in 0 to n_pitpts — 1 do 

12; for z in 0 to N_EXCHANGES — 1 do 

13; MPI_SEND{fdMmmy, 1, mpi_double, root) 

14; MPI_RECV{fremofc, 1, MPl_DOUBLE, root) 

15: local_times[i] — Get_Time() 

16; — local_times[i] — tremote — rtt/2 

17: 1^‘ff - SORT(/‘''^) 

18; yfit[idx] ^ Compute_Median(/‘^'^) 

19; idxjnedian — i s.t. 0 < z < n_exchanges Sz 

== yfitlidx] 

20: xfit[idx] — local_times[median_idx\ 

21: (slope^ intercept) — LlNEAR_FlT(x/ir, yfit, N_FITPTS) 

22: return NEW_LM(s/ope, intercept) 


23 

24 

25 

26 

27 

28 
29 


procedure Sync_Clocks 
WARMUP_ROUNDS() 
for f in 1 to p — 1 do 

l'“[i] = COMPUTE_RTT(raof, i) 

MPI_SCATTER(r“, 1, MPI_DOUBLE, rtt, 1, MPI_DOUBLE, root) 

Im =Learn_Model(n_fitpts, n_exchanges, rtt, root) 
MPI_Barrier() 


30: startjime = Get_Normalized_Time(Get_Time()) + win_size 

31: MPI_BCAST(stflrt_fime, 1, MPI_DOUBLE) 


Algorithm 16 Local time normalization to reference clock (JK 
and HCA). 

1: function GET_NORMALIZED_TlME(/ocfli_timc) 

2: return localjime — (local_time ■ Im.slope + Im.intercept) 


Algorithm 17 Measurement of the RTT between two nodes 
(JK and HCA). 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 


function COMPUTE_RTT(pl, p2) 
mean_rtt — 0 

Warmup_ROUNDS() !I send dummy ping-pong messages 
if my_rank —— pi then 

for z in 0 to n_pingpongs — 1 do 

MPI_RECv(fdwmmy, 1, mpi_double, p2) 
tremote — Get_Time() 

MPI_SEND(fremofc, 1, mpi_double, p2) 
else if my_rank —— p2 then 

for z in 0 to n_pingpongs — 1 do 
s_time — Get_Time() 

MPI_SEND(s_fime, 1, mpi_double, pi) 
MPl_RECv{tremote, 1, mpi_double, pi) 
e_time — Get_Time() 

[z] = e_time — s_time 
— Remove_Outliers(/''") 
mean_rtt — Mean(Z'’“) 
return mean_rtt 
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Appendix C 

Overview of MPI Experiments 

This section details the experimental setup and the pseudo¬ 
code of each of the experiments presented in this paper. 
Unless otherwise stated, the following configuration is 
common to all the obtained results. 


Parameter 

Values 

parallel machine 

TUWien 

compiler 

gcc 4.4.7 

compiler flags 

-02 

DVFS 

CPU frequency fixed to the highest available frequency 
level 

cache 

No cache control 

process pinning 

Core pinning {-bind-to rr / -pin_mode 
consec-rev) 

timing mechanism 

RDTSCP 


C.1 Experiment: Clock Drift 


Parameter 

Values Details 

Experiment configuration - Figure 1 

V 

nrep 

WARMUP_ROUNDS 

16 X 1 Number of nodes and processes per node 

100 Number of measured ping-pongs 

10 Number of warmup rounds performed be¬ 

fore measurement 


Algorithm 18 Experiment: Clock drift. 

1; Warmup_ROUNDS() H dummy ping-pong messages between root and r 

2: if my_rank == root then 

3; for r in 0 to p — 1 do 

4; if r root then 

5: for rep in 0 to nrep — 1 do 

6: [rep] = Get_Time() 

7: MPI_SEND(/“-'‘”'-'““'[rep], 1, MPI_DOUBLE, r) 

8: MPI_RECV(/"”“"[''ep], 1, MPI_DOUBLE, r) 

9: [rep] = Get_Time() 

10: NanOSLEEP(0.5 s) 

11: for rep in 0 to nrep — 1 do 

12: print r, _ p™ofc j^^pj 

13: else if my_rank root then 
14: for rep in 0 to nrep — 1 do 

15: MPI_RECV(tdwmmy, 1, MPl_DOUBLE, root) 

16: local_time — Get_Time() 

17: MPI_SEND(/oca/_fime, 1, mpi_double, root) 


C.2 Experiment: Frequency Caiibration 


Parameter Values Details 


Experiment configuration - Figure 2 

p 16x1 Number of nodes and processes per node 

nrep 100 Number of repetitions 

HRT_CALIBRATE - Frequency estimation function from Net- 

gauge 2.4.6 


C.3 Experiment: Ciock Offset after Synchronization 


Parameter 

Values 

Details 

Experiment configuration - Figure 3 


V 

16 X 1 

Number of processes 

n 

10 

Number of experiments 

nrep 

10 

Number of PingPong opera¬ 
tions performed in each step to 
measure the clock offset 

sleep_time 

Is 

Waiting time between clock off¬ 
set measurements 

nsteps 

10 

Number of clock offset measure¬ 
ments 

MPI 

MVAPICH 2.1a 

MPI implementation 


Algorithm 20 Experiment: Clock offset after sync. 

1: compute the list of rtts between root and process r 
2: INIT_SyNC_MODULE() 

3: for step in 1 to nsteps do 

4: if my_rank == root then 

5: for r in 0 to p — 1 do 

6: if r ^ root then 

7: for; in 0 to mounds — 1 do 

8: MPI_SEND(s_fime, 1, MPI_D0UBLE, r) 

9: MFl_RECV(tremote, 1, MPl_DOUBLE, r) 

10: if syncjtype == HCA then 

11: tlocal = Get_AdjuSTED_Time() - l''^‘[r]/2 

12: else 

13: f/octj/^ Get_Time() -/''^'[r]/2 

14: [r] [/] — tremote 

15: SLEEP(sleep_tiine) 

16: else if my_rank ^ root then 

17: MPI_RECV(tdwmmy, 1, MPl_DOUBLE, 0) 

18: if sync_type -- HCA then 

19: tlocal — Get_Adjusted_Time{) 

20: else 

21: tlocal ^ Get_Time() 

22: tglobal — GET_NORMALIZED_TlME(f/ 0 Cfl/) 

23: MPI_SEND(t^/o^7a/, 1, MPI_DOUBLE, 0) 

24: if my_rank == root then 
25: for r in 0 to p — 1 do 

26: for / in 0 to mounds — 1 do 

27: print r.f't[;•],;»“"'[,■] 


C.4 Experiment: Run-time Drift in Netgauge and 
SKaMPI 


This experiment is based on Algorithm 5, in which the 
synchronization method has been set to either Netgauge, 
SKaMPI or MPI_Barrier. 


Parameter 

Values 

Details 

Experiment configuration - Figure 4 

V 

32 X 16 

Number of processes 

n 

1 

Number of experiments 

nrep 

4000 

Number of measurements 

func 

MP I_Bcast 

Benchmarked function 

msize 

8192 Bytes 

message size 

win_size 

300 ps 

Window size 

MPI 

MVAPICH 2.1a 

MPI implementation 


Algorithm 19 Experiment: Erequency calibration. 

1: for i in 0 to nrep — 1 do 
2: HRT_CALIBRATE(/re<?) 

3: =freq 

4: MPI_GATHER(jf'‘'’, nrep, mpi_uint64_t, 

alljreqs, nrep, mpi_uint64_t, root) 

5: if my_rank == root then 
6: for r in 0 to p — 1 do 

7: for rep in 0 to nrep — 1 do 

8: print r, rep, alljreqs[r ■ nrep -\- rep] 








30 


C.5 Experiment: Clock Offset after Synchronization 


This experiment relies on Algorithm 20, where only one 
measurement round is performed after synchronization, 
instead of multiple nsteps. 


Parameter 

Values 

Details 

Algorithm parameters 

SLEEP_TIME 

Os 

No waiting time between synchroniza¬ 
tion and measurements 

nsteps 

1 

Number of clock offset estimations 

mounds 

10 

Number of PingPong operations per¬ 
formed in each step to measure clock 
offset 

N_EITPTS 

1000 

Number of fitpoints used to fit linear 

(HCA/JK sync.) 


models 

N EXCHANGES 

100 

Number of ping-pong messages ex¬ 

(HCA/JK sync.) 


changed to obtain the difference be¬ 
tween local and reference times corre¬ 



sponding to a single fit point 

MPI 

MVAPICH 2.1a 

MPI implementation 

Experiment configuration - Figure 6(a) 

V 

[2-36] xl 

Number of processes 

n 

10 

Number of experiments 

Experiment configuration - Figure 6(b) 

V 

[2-36] X 16 

Number of processes 

n 

10 

Number of experiments 


C.6 Experiment: Comparison of Synchronization Meth¬ 
ods w.r.t. the Clock Drift 


This experiment is based on Algorithm 20, in which the 
synchronization method has been set to either TTCA, JK, 
SKaMPl or Netgauge. 


Parameter 

Values 

Details 

Algorithm parameters 

SLEEP_TIME 

Is 

Waiting time between clock offset mea¬ 
surements 

nsteps 

20 

Number of clock offset estimations 

mounds 

10 

Number of PingPong operations per¬ 
formed in each step to measure clock 
offset 

N EITPTS 

1000 

Number of fitpoints used to fit linear 

(HCA/JK sync.) 


models 

N_EXCHANGES 

100 

Number of ping-pong messages ex¬ 

(HCA/JK sync.) 


changed to obtain the difference be¬ 
tween local and reference times corre¬ 



sponding to a single fit point 

Experiment configuration - Figure 7 

V 

16 X 16 

Number of processes 

n 

10 

Number of experiments 

MPI 

MVAPICH 2.1a 

MPI implementation 

Experiment configuration - Figure 33 

V 

16 X 1 

Number of processes 

n 

10 

Number of experiments 

MPI 

MVAPICH 2.1a 

MPI implementation 

Experiment configuration - Figure 34 

V 

15 X 8 

Number of processes 

n 

10 

Number of experiments 

MPI 

MVAPICH 1.9 

MPI implementation 

machine 

Edel (G5k) 


Experiment configuration - Figure 35 

V 

16 X 1 

Number of processes 

n 

1 

Number of experiments 

machine 

Cartesius (SURFsara) 

MPI 

Intel MPI 4.1 

MPI implementation 


C.7 Experiment: Run-time of MPi_Barrier 


Parameter Values 

Details 

Algorithm parameters 



n 

30 

Number of mpiruns the experi¬ 
ment was repeated 

nrep 

100 000 

Number of MPI_Barrier calls 

WARMUP_ROUNDS 

10 

Number of warmup rounds per¬ 
formed before measurement 


Experiment configuration - Figure 8 


p 32 X 16 Number of processes 

MPI MVAPICH 2.1a MPI implementation 

Experiment configuration - Figure 36 

p 16 X 1 Number of processes 

MPI MVAPICH 2.1a MPI implementation 

Experiment configuration - Figure 37 

p 15 X 8 Number of processes 

MPI MVAPICH 1.9 MPI implementation 

machine Edel (G5k) 


Algorithm 21 Experiment: Run-time of MPl_Barrier 

1: WARMUP_ROUNDS of MPI_Barrler 
2: sjime = Get_Time() 

3: for i in 0 to nrep — 1 do 
4; MPI_Barrier() 

5: e_time — Get_Time() 

6: localjtime — {ejime — sjime) jnrep 

7: MPI_REDUCE(/oCfl/_fime, t, 1, MPI_DOUBLE, MPI_MAX, WOt) 

8: if my_rank == root then 
9: print barrier time t 


C.8 Experiment: Comparison of Synchronization Meth¬ 
ods - Clock Offset vs. Synchronization Time 


Parameter 

Values 

Details 

Algorithm parameters (Algorithms 20 and 22) 

SLEEP_TIME 

Is 

Waiting time between clock offset mea¬ 
surements 

nsteps 

20 

Number of clock offset estimations 

mounds 

10 

Number of PingPong operations per¬ 
formed in each step to measure clock 
offset 

N_EITPTS 

10, 100, 200, 

Number of fitpoints used to fit linear 

(HCA/JK sync.) 

300, 500, 
700, and 
1000 

models 

N EXCHANGES 

[10-100] 

Number of ping-pong messages ex¬ 

(HCA/JK sync.) 


changed to obtain the difference be¬ 
tween local and reference times corre¬ 



sponding to a single fit point 

Experiment configuration - Figure 8 

V 

32 X 16 

Number of processes 

n 

10 

Number of experiments 

MPI 

MVAPICH 2.1a 

MPI implementation 

Experiment configuration - Figure 36 

V 

16 X 1 

Number of processes 

n 

10 

Number of experiments 

MPI 

MVAPICH 2.1a 

MPI implementation 

Experiment configuration - Figure 37 

V 

15 X 8 

Number of processes 

n 

10 

Number of experiments 

MPI 

MVAPICH 1.9 

MPI implementation 

machine 

Edel (G5k) 



Algorithm 22 Experiment: Sync, duration. 

1: sjime — Get_Time{) 

2: Init_Sync_Module() // compute clock drifts, linear models 
3: ejime — Get_Time{) 

4: syncjimejocal — ejime — sjime 

5: MFl_REDUCE{syncJimeJocal, syncjime, 1, mpi_double, mpi_max, root) 
6: print syncjime 
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C.9 Experiment: Impact of the Timing Mechanism - Lo- 
cai Times vs. Giobai Times 


This experiment is based on Algorithm 5, in which the 
synchronization method has been set to either TiCA or 

MPI_Barrier. 


Parameter 

Values 

Details 

Experiment configuration - Figure 9 


V 

16 X 1 

Number of processes 

n 

1 

Number of experiments 

nrep 

4000 

Number of measurements per ex¬ 
periment 

func 

MPI_A11reduce 

Benchmarked function 

msize 

32KiB 

message size 

win_size 

1 ms 

Window size 

N_FITPTS (HCA 
sync.) 

1000 

Number of fitpoints used to fit 
linear models 

N_EXCHANGES 
(HCA sync.) 

100 

Number of ping-pong messages 
exchanged to obtain the differ¬ 
ence between local and reference 
times corresponding to a single 
fit point 

MPI 

machine 

MVAPICH 2.0a-qlc 
VSC-3 

MPI implementation 


C.10 Experiment: MPi_Barrier Exit Times 

This experiment is based on Algorithm 5, in which the 
synchronization method has been set to HCA. 


Parameter 

Values 

Details 

Experiment configuration - Figure 10 


V 

16 X 1 

Number of processes 

n 

1 

Number of experiments 

nrep 

1000 

Number of measurements per ex¬ 
periment 

func 

MPI_Barrier 

Benchmarked function 

win_size 

loops 

Window size 

N_FITPTS (HCA 
sync.) 

1000 

Number of fitpoints used to fit 
linear models 

N_EXCHANGES 
(HCA sync.) 

100 

Number of ping-pong messages 
exchanged to obtain the differ¬ 
ence between local and reference 
times corresponding to a single 
fit point 

MPI 

machine 

MVAPICH 2.0a-qlc, 
Intel MPI 5 

VSC-3 

MPI implementation 


C.11 Experiment: Barrier Impiementation Impact 


This experiment is based on Algorithm 5, in which the 
synchronization method has been set to either MPl_Barrier 
or a dissemination barrier implemented into the benchmark. 


Parameter 

Values 

Details 

Experiment configuration - Figure 11 

V 

16 X 1 

Number of processes 

n 

10 

Number of experiments 

nrep 

1000 

Number of measurements per ex¬ 



periment 

func 

MPI_Bcast 

Benchmarked function 

msize 

2®-2^® Bytes 

Message size 

MPI 

MVAPICH 2.0a-qlc, 

MPI implementation 


Intel MPI 5 


machine 

VSC-3 



C.12 Experiment: The Influence of mpirun 


This experiment is based on Algorithm 5, in which the 
synchronization method has been set to MPl_Barrier. 


Parameter 

Values 

Details 

Algorithm parameters 

n 

30 

Number of experiments 

nrep 

1000 

Number of measurements per exper¬ 
iment 

Experiment configuration - Figure 14 (a) 

V 

16 X 1 

Number of processes 

msize 

SKiB 

Message size 

func 

MP I_Bcast 

Benchmarked function 

MPI 

NEC MPI 1.2.8 

MPI implementation 


Experiment configuration - Figure 14 (b) 


V 

32 X 1 

Number of processes 

msize 

2KiB 

Message size 

func 

MPI_Bcast 

Benchmarked function 

MPI 

Intel MPI 4.1 

MPI implementation 

machine 

VSC-1 



Experiment configuration - Figure 14 (c) 


V 

16 X 1 

Number of processes 

msize 

SKiB 

Message size 

func 

MPI_Bcast 

Benchmarked function 

MPI 

MVAPICH 1.9 

MPI implementation 

machine 

Edel (G5k) 



c.1 3 Experiment: Distribution of Run-times with 
Window-based Synchronization 


This experiment is based on Algorithm 5, in which the 
synchronization method has been set to JK. 


Parameter 

Values 

Details 

Experiment configuration - Figure 15 


V 

16 X 1 

Number of processes 

n 

500 

Number of experiments 

nrep 

1000 

Number of measurements per exper¬ 
iment 

msize 

1000 Bytes 

Message size 

func 

MPI_A11reduce 

Benchmarked function 

win_size 

1 ms 

Window size 

N_FITPTS 

1000 

Number of fitpoints used to fit linear 
models 

N_EXCHANGES 

20 

Number of ping-pong messages ex¬ 
changed to obtain the difference be¬ 
tween local and reference times cor¬ 
responding to a single fit point 

MPI 

MVAPICH 2.1a 

MPI implementation 
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C.14 Experiment: Run-time Histograms 


This experiment is based on Algorithm 5, in which the 
synchronization method has been set to either MPl_Barrier 
or HCA. 


Parameter 

Values 

Details 

Experiment configuration - Figure 12, Figure 16 

V 

16 X 1 

Number of processes 

n 

10 

Number of experiments 

nrep 

1000 

Number of measurements per ex¬ 
periment 

msize 

10 000 Bytes, 

1000 Bytes 

Message size 

func 

MPI_Scan, 

MPI_A11reduce 

Benchmarked function 

MPI 

NEC MPI 1.2.8 

MPI implementation 

Experiment configuration - Figure 41 


V 

16 X 8 

Number of processes 

n 

20 

Number of experiments 

nrep 

4000 

Number of measurements per ex¬ 
periment 

msize 

1 KiB, 8 KiB 

Message size 

func 

MPI_Bcast, 
MPI_A11reduce 

Benchmarked function 

win_size 

500 jis 

Window size 

N_FITPTS (HCA 

1000 

Number of fitpoints used to fit 

sync.) 


linear models 

N_EXCHANGES 

100 

Number of ping-pong messages 

(HCA sync.) 


exchanged to obtain the differ¬ 
ence between local and reference 
times corresponding to a single 
fit point 

MPI 

MVAPICH 1.9 

MPI implementation 

machine 

Edel (G5k) 


Experiment configuration - Figure 42 


V 

16 X 1 

Number of processes 

n 

3 

Number of experiments 

nrep 

4000 

Number of measurements per ex¬ 
periment 

msize 

Bytes 

Message size 

func 

MPI_Bcast, 
MPI_A11reduce, 
MPI_Scan 

Benchmarked function 

win_size 

500 jis 

Window size 

N_FITPTS (HCA 

1000 

Number of fitpoints used to fit 

sync.) 


linear models 

N_EXCHANGES 

100 

Number of ping-pong messages 

(HCA sync.) 


exchanged to obtain the differ¬ 
ence between local and reference 
times corresponding to a single 
fit point 

MPI 

Intel MPI 4.1 

MPI implementation 

machine 

Cartesius (SURFsara) 


Experiment configuration - Figure 43 


V 

16 X 1 

Number of processes 

n 

10 

Number of experiments 

nrep 

1000 

Number of measurements per ex¬ 
periment 

msize 

2®-2^® Bytes 

Message size 

func 

MPI_Bcast, 
MPI_A11reduce 

Benchmarked function 

win_size 

1 ms 

Window size 

N_FITPTS (HCA 

1000 

Number of fitpoints used to fit 

sync.) 


linear models 

N_EXCHANGES 

100 

Number of ping-pong messages 

(HCA sync.) 


exchanged to obtain the differ¬ 
ence between local and reference 
times corresponding to a single 
fit point 

MPI 

MVAPICH 2.0a-qlc 

MPI implementation 

machine 

VSC-3 



C.15 Experiment: Run-time Drift-JK vs. MPi_Barrier 
Synchronization 

This experiment is based on Algorithm 5, in which the 
synchronization method has been set to either JK or 

MPI_Barrier. 


Parameter 

Values 

Details 

Experiment configuration - Figure 17 


V 

16 X 1 

Number of processes 

n 

1 

Number of experiments 

nrep 

500 000 

Number of measurements per exper¬ 
iment 

msize 

1000 Bytes 

Message size 

func 

MPI_A11reduce 

Benchmarked function 

win_size 

1 ms 

Window size 

N_FITPTS (JK) 

1000 

Number of fitpoints used to fit linear 
models 

N EXCHANGES 

QK) 

20 

Number of ping-pong messages ex¬ 
changed to obtain the difference be¬ 
tween local and reference times cor¬ 
responding to a single fit point 

MPI 

MVAPICH 2.1a 

MPI implementation 


C.16 Experiment: Impact of Window Size on Run-time 
and Number of Invaiid Resuits 

This experiment is based on Algorithm 5, which has been 
repeated for TTCA, SKaMPl and Netgauge. 


Parameter 

Values 

Details 

Experiment configuration - Figure 19, Figure 20 

V 

16 X 1 

Number of processes 

n 

10 

Number of experiments 

nrep 

1000 

Number of measurements per exper¬ 
iment 

msize 

8 KiB 

Message size 

func 

MPI_Alltoall, 
MPI_Scan 

Benchmarked function 

win_size 

[150-10 000] ps 

Window size 

N FITPTS 

1000 

Number of fitpoints used to fit linear 

(HCA) 


models 

N EXCHANGES 

100 

Number of ping-pong messages ex¬ 

(HCA) 


changed to obtain the difference be¬ 
tween local and reference times cor¬ 
responding to a single fit point 

MPI 

MVAPICH 2.1a 

MPI implementation 



33 


C.17 Experiment: Run-time Drift Comparison 


This experiment is based on Algorithm 5, which has been 
repeated for each synchronization method. 


Parameter 

Values 

Details 

Algorithm parameters 

nrep 

4000 

Number of measurements per exper¬ 
iment 

win_size 

500 ps 

Window size 

N FITPTS (JK, 

1000 

Number of fitpoints used to fit linear 

HCA) 


models 

N_EXCHANGES 

100 

Number of ping-pong messages ex¬ 

QK, HCA) 


changed to obtain the difference be¬ 
tween local and reference times cor¬ 



responding to a single fit point 

Experiment configuration - Figure 18 

V 

32 X 16 

Number of processes 

n 

10 

Number of experiments 

msize 

SKiB 

Message size 

func 

MPI_A11reduce 

Benchmarked function 

MPI 

MVAPICH 2.1a 

MPI implementation 

Experiment configuration - Figure 38 

V 

15 X 8 

Number of processes 

n 

10 

Number of experiments 

msize 

32KiB 

Message size 

func 

MPI_Bcast 

Benchmarked function 

MPI 

MVAPICH 1.9 

MPI implementation 

machine 

Edel (G5k) 


Experiment configuration - Figure 39 

V 

16 X 1 

Number of processes 

msize 

SKiB 

Message size 

n 

3 

Number of experiments 

func 

MPI_Scan 

Benchmarked function 

MPI 

Intel MPI 4.1 

MPI implementation 

machine 

Cartesius (SURFsara) 

Experiment configuration - Figure 40 

V 

16 X 1 

Number of processes 

msize 

IKiB 

Message size 

n 

10 

Number of experiments 

func 

MPI_A11reduce 

Benchmarked function 

MPI 

MVAPICH 2.0a- 
qlc 

MPI implementation 

machine 

V$C-3 



C.18 Experiment: Pinning Effect 

This experiment is based on Algorithm 5 and was conducted 
using the TTCA synchronization. 

Parameter 

Values 

Details 

Experiment configuration - Figure 21 


V 

16 X 16 

Number of processes 

n 

10 

Number of experiments 

nrep 

1000 

Number of measurements per exper¬ 
iment 

msize 

[1000- 
10 000] Bytes 

Message size 

func 

MPI_A11reduce 

Benchmarked function 

win_size 

1 ms 

Window size 

N_FITPTS 

1000 

Number of fitpoints used to fit linear 
models 

N_EXCHANGES 

100 

Number of ping-pong messages ex¬ 
changed to obtain the difference be¬ 
tween local and reference times cor¬ 
responding to a single fit point 

MPI 

NEC MPI 1.2.11 

MPI implementation 

clock 

MPI_Wtime 

clock source for time measurements 


C.19 Experiment: Compiier Effect 

This experiment is based on Algorithm 5 and was conducted 
using the TTCA synchronization. 


Parameter 

Values 

Details 

Experiment configuration - Figure 22 


V 

16 X 1 

Number of processes 

n 

30 

Number of experiments 

nrep 

1000 

Number of measurements per exper¬ 
iment 

msize 

2®-2^^ Bytes 

Message size 

func 

MPI_A11reduce 

Benchmarked function 

win_size 

1 ms 

Window size 

N_FITPTS 

1000 

Number of fitpoints used to fit linear 
models 

N_EXCHANGES 

100 

Number of ping-pong messages ex¬ 
changed to obtain the difference be¬ 
tween local and reference times cor¬ 
responding to a single fit point 

MPI 

MVAPICH 2.1a 

MPI implementation 


C.20 Experiment: DVFS Effect 

This experiment is based on Algorithm 5 and was conducted 
using the TTCA synchronization. 


Parameter 

Values 

Details 

Experiment configuration - Figure 23 


V 

16 X 1 

Number of processes 

n 

30 

Number of experiments 

nrep 

1000 

Number of measurements per exper¬ 
iment 

msize 

26_2i3 Bytes 

Message size 

func 

MPI_A11reduce 

Benchmarked function 

win_size 

1 ms 

Window size 

N_FITPTS 

1000 

Number of fitpoints used to fit linear 
models 

N_EXCHANGES 

100 

Number of ping-pong messages ex¬ 
changed to obtain the difference be¬ 
tween local and reference times cor¬ 
responding to a single fit point 

MPI 

MVAPICH 2.1a, 
NEC MPI 1.2.11 

MPI implementation 


C.21 Experiment: Caching Effect 

This experiment is based on Algorithm 5 and was conducted 
using the TTCA synchronization. 


Parameter 

Values 

Details 

Experiment configuration - Figure 24 


V 

16 X 1 

Number of processes 

n 

10 

Number of experiments 

nrep 

1000 

Number of measurements per exper¬ 
iment 

msize 

26_2i3 Bytes 

Message size 

func 

MPI_A11reduce 

Benchmarked function 

win_size 

1 ms 

Window size 

N_FITPTS 

1000 

Number of fitpoints used to fit linear 
models 

N_EXCHANGES 

100 

Number of ping-pong messages ex¬ 
changed to obtain the difference be¬ 
tween local and reference times cor¬ 
responding to a single fit point 

MPI 

MVAPICH 2.1a 

MPI implementation 
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C.22 Experiment: Statistical Testing 

This experiment is based on Algorithm 5 and was conducted 
using the HCA synchronization. 


Parameter 

Values 

Details 

Experiment configuration - Figure 25, Figure 26, Figure 28 

V 

16 X 1 

Number of processes 

n 

30 

Number of experiments 

nrep 

1000 

Number of measurements per ex¬ 
periment 

msize 

1-2^^ Bytes 

Message size 

func 

MPI_A11reduce 

Benchmarked function 

win_size 

adapted for each 
message size 

Window size 

N_FITPTS 

1000 

Number of fitpoints used to fit lin¬ 
ear models 

N_EXCHANGES 

100 

Number of ping-pong messages ex¬ 
changed to obtain the difference 
between local and reference times 
corresponding to a single fit point 

MPI 

MVAPICH 2.1a, 
NEC MPI 1.2.11 

MPI implementation 


C.23 Experiment: Reproducibility Evaluation 

This experiment relies on the Intel MPI Benchmarks 4.0.2, 
SKaMPI 5 and Algorithm 5 (using the HCA synchronization 
method) to measure the run-time of MPl_Bcast. 


Parameter 

Values 

Details 

Algorithm parameters 


Experiment 

30 

Number of times the full experi¬ 

runs 


ment was repeated 

V 

16 X 1 

Number of processes 

msize 

1-2^^ Bytes 

Message size 

func 

MPI_Bcast 

Benchmarked function 

MPI 

MVAPICH 2.1a 

MPI implementation 

Experiment configuration - Figure 29(c) 


n 

30 

Number of mpiruns 

nrep 

1000 

Number of measurements per ex¬ 
periment 

win_size 

adapted for each 
message size 

Window size 

N_FITPTS 

1000 

Number of fitpoints used to fit lin¬ 
ear models 

N_EXCHANGES 

100 

Number of ping-pong messages ex¬ 
changed to obtain the difference 
between local and reference times 
corresponding to a single fit point 


C.24 Experiment: Run-time Comparison 


This experiment is based on Algorithm 5, which has been 
repeated for each synchronization method. 


Parameter 

Values 

Details 

Experiment configuration - Figure 44 


V 

32 X 16 

Number of processes 

n 

10 

Number of experiments 

msize 

1-2^^ Bytes 

Message size 

func 

MPI_A11reduce 

Benchmarked function 

nrep 

1000 

Number of measurements per exper¬ 
iment 

win_size 

150 ps 

Window size 

N FITPTS (JK, 

HCA) 

1000 

Number of fitpoints used to fit linear 
models 

N EXCHANGES 
QK, HCA) 

100 

Number of ping-pong messages ex¬ 
changed to obtain the difference be¬ 
tween local and reference times cor¬ 
responding to a single fit point 

MPI 

MVAPICH 2.1a 

MPI implementation 
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Appendix D 

Thematic Summary of Measurements 

D.1 Investigating the Round Trip Time (RTT) 
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Figure 30. Histogram of RTTs when sending 1000 ping-pong messages 
with 1 double (8 Bytes) payload between reference host (0) and 15 other 
hosts (1-15), (MVAPICH 2.1a, TUWien). 
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D.3 Investigating the Ciock Drift after Synchronization 

D.3.1 TUWien 


150 

O 

CO 

I 100 

o 

S-i 

a 

o 

^ 50 
q; 

a 

0 


Figure 33. Clock drift for 16 x 1 processes after 0, 1, 2, ..., 20s 
(distribution of maximum offsets over 10 calls to mpirun, MVAPICH 2.1a, 
TUWien, Exp. details: Appendix C.6). 
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D.3.2 Edel(G5k) 
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Figure 31. Mean RTT (after outlier removal) computed from 1000 ping- 
pongs between one pair of nodes (MVAPICH 2.1a, TUWien). 


Figure 34. Clock drift for 120 (15 x 8) processes after 0, 1, 2, ..., 10 s 
(distribution of maximum offsets over 10 calls to mpirun, MVAPICH 1.9, 
Edel (G5k), Exp. details: Appendix C.6). 


D.2 Investigating the Number of Invalid Measurement 
for Window-based Process Synchronization 



Figure 32. Percentage of incorrect measurements for MPI_Bcast, (8KiB, 
16 X 1 processes, 1000 measurements, 10 calls to mpirun, MVA¬ 
PICH 2.1a, TUWien). 
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Figure 35. Clock drift for 16 x 1 processes after 0, 1, 2, ..., 10s 
(1 call to mpirun, Intel MPI 4.1, Cartesius (SURFsara), Exp. details: 
Appendix C.6). 
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D.4 Investigating the Synchronization Efficiency 

D.4.1 TUWien 


Clock offset vs. sync duration (^litpoints, ^exchanges) 



Synchronization duration [s] 


Figure 36. Median clock offset (after 2 s) vs. synchronization phase 
duration for 16 x 1 processes (10 calls to mpirun, MVAPICH 2.1a, 
TUWien, Exp. details: Appendix C.8). 


D.4.2 Edel(G5k) 
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Figure 37. Median clock offset (after 5 s) vs. synchronization phase 
duration for 120 processes (15 x 8) (10 calls to mpirun, MVAPICFI 1.9, 
Edel (G5k), Exp. details: Appendix C.8). 

D.5 Measuring Drift over Time 

D.5.1 Edel(G5k) 
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Figure 39. Drifting run-times of MPi_Scan (median of 3 experiments, 
8KiB, 16 X 1 processes, 4000 runs, bin size 100, window size: 500ps, 
Intel MPI 4.1, Cartesius (SURFsara), Exp. details: Appendix C.17). 
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Figure 40. Drifting run-times of MPi_Aiireduce (median of 10 exper¬ 
iments, IKiB, 16 X 1 processes, 4000 runs, bin size 100, window size: 
1ms, MVAPICFI 2.0a-qlc, VSC-3, Exp. details: Appendix 6 . 17 ). 
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Figure 38. Drifting run-times of MPi_Aiireduce (median of 10 experi¬ 
ments, 32KiB, 15 X 8 processes, 4000 runs, bin size 100, window size: 
500 ps, MVAPICFI 1.9, Edel (G5k), Exp. details: Appendix C.17). 
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D.6 Distribution of Measured Run-times D.6.3 VSC-3 
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Figure 41. Distribution of run-times and corresponding autocorrelation 
plots (16 X 8 processes, 4000 runs, FICA synchronization, window size: 
500 ps, MVAPICFI 1.9, Edel (G5k), Exp. details: Appendix C.14). 
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Figure 43. Various distributions of run-times and corresponding autocor¬ 
relation plots for several calls to mpirun (16 x 1 processes, 1000 runs, 
FICA synchronization, window size: 1ms, MVAPICFI 2.0a-qlc, VSC-3, 
Exp. details: Appendix C.14). 


Figure 42. Distribution of run-times and corresponding autocorrelation 
plots (16 X 1 processes, 4000 runs, HCA synchronization, window 
size: 500ps, Intel MPI 4.1, Cartesius (SURFsara), Exp. details: Ap¬ 
pendix C.14). 
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D.7 Comparing Run-times measured using 
MPi_Barrier or Window-based Synchronization 


MPI_Bcast, 32x16 processes, 10 mpiruns 
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Figure 44. Distribution of normalized median run-times reported for 
MPi_Bcast with various message sizes, for the window-based syn¬ 
chronization method HCA (window size: 150 ps) and MPi_Barrier, (10 
calls to mpirun, 32 X 16 processes, 1000 runs, MVAPICH 2.1a, TUWien, 
Exp. details: Appendix C.24). 



